4. Programming for reconfigurable systems

4.1 Introduction

Creating circuits has always been a rather different process from creating algorithms for Von Neumann style sequential processors. Traditionally circuits have been designed using paper and pencil and by drawing large circuit diagrams. Computer programming on the other hand was done with an assembler or a high-level computer language. Circuits were designed to be completely static in hardware, with all circuit elements operating simultaneously. Programs were transient, with only one instruction operating at a time. Clearly the methods used in creating software were very different from those used to create hardware.

Programmable hardware leads us to re-examine this dichotomy. We would like to use the high-level programming languages which make writing software so easy, yet they are strongly oriented towards single-execution, state-driven environments. High-level hardware development languages such as VHDL are available, but they are not well suited to generic description of algorithms.

A great deal of work has been done in recent years to adapt conventional programming languages to output hardware descriptions. These are successful in as much as they work, yet they leave a great deal to be desired in terms of the efficiency of the hardware they create. Early attempts [PageLuk91] converted a program into a large state machine. Effectively this transmuted the states of the program counter into a series of latches, creating hardware which had no higher degree of parallelism than the sequential code it was based on. More recent work has used dataflow analysis to create smaller, more efficient state machines which make better use of the inherently parallel nature of circuitry. For more detail see section 2.4.

One problem with any system based on an existing programming language is that information which is fundamentally important to efficient hardware implementation is often simply not available in the source language. Conventional programming languages assume that most arithmetic is done on ``a word'' - usually 32 or 64 bits. This is reasonable in a conventional architecture since the ALU will have this many bits available in any case and there is little to be gained from performing an operation on fewer bits. In hardware however all possible operations in the program may be expressed in silicon simultaneously, so for instance using a 64-bit register to store a single-bit flag is going to result in horribly inefficient use of hardware. Some projects modify conventional languages to attempt to patch these holes (Occam [PageLuk91] [CockshottShaw92] [ShawMilne92], Ruby [LukLok93] [Luk95] [GuoLuk95], and Handel [SpiveyPage93]), but there is still a long way to go before code generated by a programming language will use logic array space efficiently.

One aspect of computing which has been brought out by the advent of reconfigurable logic is that current programming techniques are very strongly influenced by the underlying architectures they're targeted at. Just as computer languages such as BASIC which are intended to be interpreted are very different from Algol-style languages which are intended for compilation, there is also a great difference between languages intended for sequential execution on conventional processors and languages capable of efficiently expressing algorithms for massively parallel fine-grained computation on logic arrays. Finding common ground between the two approaches is a daunting task.

Even more daunting is the prospect of finding good ways of expressing features such as run-time reconfiguration - in a conventional language this could be akin to the old-time evil of self-modifying code! Further reflection reveals that the similarities between run-time reconfiguration and self-modifying code are not so great. Self-modifying code acquired a bad reputation back in the days when it transformed machine code which was already difficult to understand into something completely incomprehensible. Generalised high level language constructs to support self-modifying code in a comprehensible way have never become widely available. The final straw for self-modifying code is that modern processors support self-modifying code very poorly due to their heavy use of caches.

To some extent modern compiler techniques redress the differences between conventional programming techniques and those required for logic arrays. Using data flow analysis a sequentially expressed algorithm can be converted into a graph of operations of which some may be executed in parallel. In future it seems likely that programming languages which express algorithms in a more universal form will be developed. It is unclear at this stage what these languages might look like.

In the interim a number of approaches have been taken to develop programming and circuit synthesis systems for current reconfigurable logic. The simplest approach is that used by SPLASH - partitioning is by hand with sequentially executed code being written in C and circuits for the logic array being written in VHDL [GokhaleHolmes91] [Lopresti91]. Perle takes a slightly different approach, converting cleverly formulated circuit descriptions written in C++ to netlists for synthesis [VuilleminBertin96]. Others have created even more sophisticated systems, translating programs partly into sequential code and partly circuitry, with automatic partitioning between the two [Athanas93] [ThomaeVandenbout92] [LukLok93] [WoForward94] [GokhaleMarks95]. See section 2.4.3 for more details.

4.1.1 Partitioning

An important stumbling block in creating code for logic arrays is that logic circuits are inherently space consuming and an entire program may not fit on a logic array. Usually only the most time-critical part of the program is converted into logic while the more algorithmically complex parts of the program are run on a conventional processor which operates in tandem with the logic array. This leads to the partitioning problem - which parts of a program to put in logic and which parts to run as sequential code?

As automatic programming tools for logic arrays are developed more and more work is being done on automatically partitioning programs into sections which can be run in logic and sections which are better run sequentially (see section 2.4.3). This problem alone is very complex and it is easy to see that the optimum result may be impossible to determine unless the data set and run profile is known, on the same basis that the ``halting problem'' prevents us from knowing information about a program's execution until it has actually executed.

The partitioning problem becomes even more complex when we cease to be dealing with program-and-forget logic arrays and begin to dynamically reconfigure them at runtime. In this case the partitioning must be performed over domains of both space and time, and details of the reprogrammability and context switching characteristics of the device must be taken into consideration.

4.1.2 Virtual logic

The advent of dynamic reconfigurability offers the scope to choose at any given time throughout execution what modules are best to load. A number of researchers have proposed ``virtual logic'' which operates in much the same way as virtual memory - sections of logic are loaded into the logic array when required (see section 2.5) [LingAmano93] [Casselman93] [AlbaharnaCheung94] [AlbaharnaCheung94b] [Gokhale94] [SinghBellec94] [VuilleminBertin96]. This concept while initially attractive leads to some significant problems - circuits normally communicate by means of logic lines and it is not practical to simply ``chop off'' the connections to a section of circuit. Nor do circuits inherently ``know'' when they're blocked waiting on data and that they can be switched out of logic. A higher-level mechanism must be used to communicate between the modules in this case, and the flow of data between the modules must be carefully controlled to prevent ``thrashing'' of logic in and out of the configuration store. Current systems rely heavily on the application programmer to mediate reconfiguration and communications.

This kind of programmer-optimised communication mechanism is a long way from the ideal solution of platform-independent, automatic compilation from high-level source code. What's more the exact form of the communication is dependent in hardware implementation details such as the speed of reconfiguration and the size of the logic array. Ideally code would be compiled from a generic algorithm to operate optimally on whatever platform was available.

4.2 Techniques for programming dynamic logic arrays

The purpose of this chapter is to determine what programming styles are most useful in a computer system based entirely on reconfigurable logic. This results of this analysis will be used in subsequent chapters as a guide for the development of architectural features for the Switchblade architecture. In the process a number of features useful as performance enhancements are also investigated.

Figure4.1 shows a plot of various architectures against axes of ``communication switching time'' and ``parallelism''. ``Communication switching time'' here refers to the amount of time taken for one section of code to communicate a result to another. In the case of conventional sequential processors this is the time taken for a subroutine call and in the case of logic array processors it is the time taken to get a result from one section of the circuit to another. The graph is not to scale - it is only intended to convey an impression of the performance advantages of various architectures.

Figure 4.1 - Effects of communication and parallelism on speed

The overall performance of a processor is a combination of its ability to perform computation at high speed (here quantified only as ``parallelism'') and its ability to communicate those results to the places where they are needed. The slowest combination is on the top left of the graph and the fastest is on the lower right. Conventional processors have a fairly low level of parallelism - boosted mainly by the fact that a single instruction can cause many logic gates to be used simultaneously - and a moderately high function call overhead. A logic array system where the entire program can fit on the array at once has the highest level of parallelism and very little communication overhead. It is also very likely to be prohibitively expensive.

The more economical way to use logic arrays is some form of run-time reconfiguration - array space is saved by configuring in only the sections of the circuit which are needed at any given time. Unfortunately reconfiguration itself is inherently expensive, requiring a large amount of data to be transferred in swapping a single configuration from memory. A solution based around multiple configuration stores was developed as part of the early work for this thesis and was published first in [Saleeba93]. Multiple configuration stores exploit the fact that most applications use only a relatively small number of configurations, so offering multiple configurations stored within the logic array itself drastically reduces the need to load configurations from off chip. This means that most reconfigurations can occur instantly by switching configurations stores, with only a small percentage requiring transfers from off-chip.

4.2.1 Example of control flow

Figure 4.2 - A program as a hierarchy of communicating parts

Figure4.2 shows a program as a hierarchy of communicating parts. In hardware these would be interconnected circuits, and in software they would be groups of subroutines communicating by means of subroutine calls. At the top of the hierarchy are the fundamental routines which call the lower level routines. Most of the processing is in the lower level routines and more of the algorithmic control is in the higher routines. In general the partitioning process will move the more heavily used leaf routines into hardware and leave the less frequently used upper routines in software. The figure shows an actual partitioning of a data analysis program, without yet assigning modules to a final medium of hardware or software.

4.2.2 Impact of architectural features

Just as having a massively parallel fine-grained computing architecture affects the way we program it, so too do other architectural features the array may have. In this thesis five architectural features in particular will have an influence on the algorithms we can devise -

  • Run-time reconfiguration
  • Partial reconfiguration
  • Multiple configuration stores
  • Self reconfiguration
  • Lack of a conventional coprocessor

Run-time reconfiguration allows us to change the form of the circuit while the system is operating. This can be used to optimise the system to operate faster according to runtime-determined factors, or to swap sections of circuit in and out of the array as they are needed. See section 2.5 for a survey of research into run-time reconfiguration and section 3.1.1 for a more detailed description.

Partial reconfiguration allows us to reconfigure just part of the array while leaving the rest intact and still operating. This can be useful for swapping between configurations - usually some of the circuit remains constant while other parts may change. See section 3.1.2 for a more detailed description.

Multiple configuration stores permit very fast switches between configurations. This has a strong effect on the viability of run-time reconfiguration - if multiple configuration stores are available much greater use can be made of rapid changes between configurations. See section 3.1.3 for a more detailed description.

The use of multiple configuration stores in logic arrays is relatively new - first described by Saleeba in [Saleeba93]. Hydra was a system designed with logic emulation in mind [Maliniak94]. It provided sixteen contexts as did DPGA, a logic array intended for computing applications [BolotskiDehon94]. VEGA took a rather different approach with a very large number of contexts - 1024 [BolotskiDehon94]. DPGA results demonstrated that a 40%-50% reduction in array space could be achieved using an eight context version of the architecture, given hand-coded examples capable of exploiting the feature. The CHiP architecture is an earlier example of a system using multiple configuration stores, applied not to a logic array but to a programmable interconnect system [Snyder82].

Self reconfiguration is another architectural feature. It implies that the device has control over its own reconfiguration. This opens new vistas of possibility in algorithm-generating algorithms, variable bit width arithmetic and data folding using self-modifying circuitry. First described by Saleeba in [Saleeba93], this feature is vital for a self-contained reconfigurable processor such as that described in this thesis. See section 3.1.4 for a lengthier description of self reconfiguration.

The architecture presented in this thesis is unusual in that it lacks a conventional processor acting as a host and controlling the logic array. This makes some aspects easier - the partitioning problem is less immediate for instance since everything must run on the logic array in some way. On the other hand this approach leads to new questions on how to control an autonomous logic array.

4.2.3 Programming techniques

A great deal of research remains to be done in devising programming techniques to exploit these architectural features. In this section a small number of programming techniques are outlined which exploit these architectural features.

These techniques are not necessarily mutually exclusive - several may be applicable to a given algorithm but offer different speed/space tradeoff advantages. Section 4.3 demonstrates this by applying most of these techniques to an example prime number generation algorithm.

It is not the purpose of this thesis to investigate programming techniques in detail - only to investigate them in as much as they impact the architecture.

4.2.3.1 Enumerative approach

One example of a programming style which is attractive in hardware but not in software is an enumerative approach. Rather than taking a mathematical approach to a problem - developing formulas and having a general purpose processor calculate the results - the nature of reconfigurable hardware favours an approach where a small amount of clocked logic comes to the same result using an iterative approach. Since the logic may be clocked at a very high rate it will often exceed the speed of the formula-based approach and require less logic resources. Logic arrays are very good at implementing counters and adders in small amounts of space but given how expensive in die area transcendental functions can be they are inherently less attractive in this medium.

An example of this type of algorithm is the classic ``sieve'' prime number generator. Compared to other methods of finding prime numbers (discussed in section 4.3.1) it is very well suited to efficient hardware implementation.

This programming approach is common to all hardware implementations - it is not specifically related to configurable logic. The following programming approaches on the other hand are all specifically related to configurable hardware in some way.

4.2.3.2 Overlay approach

Logic arrays tend to be capable of a high degree of parallelism but also suffer some drawbacks from being better adapted to parallel than sequential operations. Logic arrays are limited in size, and one way this limitation shows is in the amount of code which can be on the array at any one time. Overlays are a simplistic answer to this problem. They allow the programmer to nominate to replace one section of code with another at some point during execution. This is a primitive form of run-time reconfiguration. Depending on the architectural features of the target logic array it may or may not be possible to load an overlay to only a limited section of the logic array - it may be necessary to reconfigure the entire array at once. Similarly the reconfiguration process may vary greatly in speed depending on the reconfiguration interface used and whether multiple configuration stores are available.

An overlay approach is used with most of the current generation of run-time reconfigurable systems. The programmer must explicitly nominate where one region of execution ends and the next overlay begins. Methods such as automatic partitioning are attempts to automate this process and make it less intrusive on the programmer.

This approach is by no means limited to logic arrays. Overlays were used extensively in the early days of microcomputers when memory was at a premium. Large programs were loaded one section at a time and when each section was completed the application would ``chain'' along to the next overlay, loading it from secondary storage before execution.

4.2.3.3 Double-buffered approach

This approach is an optimisation of the overlay approach which uses the benefits of multiple configuration stores and/or partial reconfiguration. While one configuration is operating, an unused part of the configuration store is being reconfigured with a new configuration preparatory to it being required later. Given the slow speed of reconfiguration, overlapping reconfiguration with computation can result in good speedups for algorithms which reconfigure often.

4.2.3.4 Limited word length arithmetic

Conventional processors have a single internal data word width and all operations are capable of operating on this width. Modern RISC processors aggressively optimise the ALU for this word width to obtain the results of most operations in a minimum time quantum. As a result there is no motivation for the programmer to produce programs which use data items containing less than the full word width. Logic arrays on the other hand can be considered massively parallel single-bit computers. Every bit counts, so using a C ``int'' to represent a flag is likely to waste a large number of bits and a large amount of array area.

Current programming languages are not well designed to provide this kind of control over final word length. What's more word length may vary along a computational path as accruing results change in size. Automatic methods of determining word length are limited by the inability to determine in advance what the size of a result will be, in much the same way that the ``halting problem'' prevents foreknowledge of whether a program will terminate.

In the short term it is possible to add some simple extensions to conventional languages to specify word widths. This allows data paths and arithmetic to operate on a programmer-specified word length, saving a great deal of logic array area and allowing more functionality to be crammed into the logic array. Some work on variable word length arithmetic on reconfigurable processors was done on Perle [VuilleminBertin96].

4.2.3.5 Variable word lengths - temporally partitioned

Another approach is to allow variable word lengths. This relies on run-time reconfigurability to change the width of a data path during the execution of a program. For some algorithms a compiler can determine that a data path width will be different when a module is used at various different points. Separate configurations can be generated for each of these data path widths and each regarded as a separate entity - much as some programming languages allow function overloading by data type.

This approach is one example of the use of special cases over the time domain, which is discussed in more detail in section 4.2.3.7.1.

4.2.3.6 Variable word lengths - on demand

Programmers are often troubled by the same problem as compilers - sometimes it's hard to know in advance what word length will be required. Run-time reconfiguration allows us another option - to reconfigure data path widths if necessary. Conceptually a circuit could route overflow bits to an ``interrupt'' equivalent which would cause the system to reconfigure the specific data path with more bits. In reality this would be very expensive, requiring that the very expensive circuit synthesis processes be repeated for the new word width. It may however prevent catastrophic program failure and data loss so it would almost certainly be preferable to the alternative.

4.2.3.7 Constructive approach

Traditional programming wisdom abhors the creation of self-modifying code. Conventional programming tools also usually preclude it. In reconfigurable logic arrays we stand to gain a great deal from this technique though, particularly through the use of a constructive approach to programming.

As was mentioned earlier, the current generation of conventional processors are poorly suited to run-time reconfiguration due to their use of caches. The constructive approach and many of the other approaches described here require run-time reconfiguration so it is clear that they are poorly supported by conventional processors, and even if they were, the performance advantages which they offer by expoiting the fine-grained parallelism of logic arrays would not be available on a conventional processor anyway. These factors go a long way toward explaining why approaches such as the constructive approach are not popular in programming already.

Often a program starts with a very limited data set. As it runs it acquires more data from an I/O source and operates on it, adding it to the data it already has, often with a complexity increasing as the internal data set increases in size. This added time cost can be reduced or avoided by exploiting the logic array's inherent parallelism - processing of a new data item causes a new circuit to be generated which operates in parallel with all others when the next data item is read. The array space used increases until some physical bound is reached or the algorithm terminates.

The constructive approach is extremely powerful in that it reduces many algorithms to O(n) complexity. A disadvantage is that current architectures do not support this style of reconfiguration in any useful way so the reconfiguration overhead may be high. The Switchblade architecture described in this thesis is somewhat better adapted to this purpose, but this still remains an area for research. A second disadvantage of the constructive approach is that it is limited by the size of the logic array.

4.2.3.8 Staged approach

The ``constructive approach'' has a major limitation in its assumption that the data set will fit entirely within the logic array. The staged approach is similar in concept except that it assumes that the data set will not fit within the logic array. It relies on external memory to store intermediate results and run-time created overlays which are loaded sequentially. Each overlay contains a subset of the constructed circuitry. The system passes the input through the first stage overlay and results are stored in memory. Then the array is reconfigured and the temporary results are passed back through the second stage overlay. Once all the overlays have been used, a new set of results have been found. These new results may cause the creation of a new overlay which will be used in subsequent passes. In many cases the staged approach provides a large constant speedup over a conventional approach.

A half-way step between the constructive and staged approaches is possible if multiple configuration stores are available. Up to the limit of the number of configuration stores available, constructed overlays can be kept in alternate stores on the logic array. When a new data item is read the multiple configurations can quickly be flipped through, performing the processing for several overlays before requiring a memory write. This greatly reduces the overhead caused by memory accesses.

4.2.3.9 Data folding

Data folding is where some variables which are constant over the duration of a program run can be treated instead as constants, programming them directly into the configuration for the one run [Foulk93]. This can be considered a limited (but important) case of the constructive approach. The parameters are part of the input set but the constructive approach is applied only for those parameters where it is considered worthwhile.

This approach can significantly increase the density of configurations if the folded variables are heavily used.

4.2.3.10 Run-time data folding

Another possibility is to take data folding, but not limit it to the duration of a program. Instead, algorithms can run-time reconfigure themselves to take different fundamental values as the algorithm progresses. Changing the value of one of these programmed-in variables would be relatively expensive but will still be viable in cases where the changes of value are infrequent.

Data folding at run-time is exactly equivalent to a subset of the now spurned technique of self-modifying code. The technique fell out of favour due to the difficulty of understanding and maintaining self-modifying programs (which were mostly written in assembly language). These days programming and compiler techniques have advanced to the point where it is quite easy to conceive of compilers automatically producing self-modifying code as an optimisation without the programmer needing to be concerned about it. Given the lack of advantage to be obtained from self-modifying code on conventional processors and the caching problems that it causes it is not often used. In logic arrays however there is potentially great advantage to be obtained from this technique so a return to the old technique of self-modification may be a profitable area for future research.

4.2.3.11 Sequence/space tradeoffs

Conventional sequential processors can be considered opposite in approach to logic array processors - they achieve speed by performing complex calculations in sequence where logic arrays achieve speed by performing simple operations in parallel. Sequential processors suffer the disadvantage that highly regular operations must still be executed sequentially or with a low degree of parallelism. Logic arrays are excellent at fine-grained parallelism but poor at complex, irregular algorithms.

Each is an extreme response to the problem of executing arbitrary algorithms. Each works for all problems, but tends to offer poor performance for problems it's poorly suited to. Clearly there must be a middle ground. MATRIX [MirskyDehon96] is the result of one attempt to find this middle ground, but unfortunately it failed to perform to expectations.

Reconfigurable logic arrays give us the option of configuring a sequential processor on to the array if we desire to. Many researchers have implemented conventional processors on logic arrays [GrunbacherJaud92] [Davidson93] [SurmannUngering92] [PetersenHutchings95] [Meier95] [IseliSanchez93] [Brunvand92]. Some researchers have taken a further step by parameterising these processors so they may be adapted to the specific application [Meier95] [IseliSanchez93] [Brunvand92]. Others have investigated the possibilities of creating customised instruction set processors in reconfigurable logic [AthanasSilverman93] [LewisVanierssel93] [Witting95a] [WirthlinGilson94] [WirthlinHutchings95] [WirthlinHutchings95b].

One answer to the problem of finding a middle ground between logic and sequential computation is to implement sequential processing units on the logic array. What's more, these sequential processing units can be customised to the level of regularity expressed by the algorithm. A relatively simple algorithm might get by with a sequential controller implemented by a finite state machine in a few latches. A more complex and irregular algorithm might be best implemented by a processor configuration running specialised instructions out of memory.

The problem can also be attacked on a more local level. Some parts of an algorithm might exhibit regularity but others not. In this case a sequential processor could be configured in only when needed. Another case might contain highly regular functional units being controlled by a complex algorithm. A small sequential processor controlling an area of logic would be better suited to this. In yet another case complex independent operations with a high level of local communication might call for an array of sequential processors to be configured into the array.

Some method of automatically generating sequential processors well-suited to specific tasks within an algorithm is needed. Neither completely-sequential nor completely-logic solutions are ideal in all cases - an approach combining the two is more likely to provide an efficient solution. The balance of sequential processor size versus speed will depend on a number of factors including how overall performance is affected by its speed, how regular the process is that it controls and how its space use affects the ability of the rest of the system to perform.

Ideally a compiler would locate sections of an algorithm involving sequence and determine the best style of sequential processor for the application. In most cases this would be as simple as a few latches. In more involved cases quite a complex processor might result. These ideas are worthy of a research project and thesis in their own right.

4.2.3.12 Algorithm folding over the time domain

The ultimate extension of the approaches covered in this section is to apply multiple approaches simultaneously. In some cases the approaches are compatible and in other cases they are mutually exclusive. When mutually exclusive, one approach may be favoured over another dependent on conditions encountered at runtime. It may be desirable to offer an algorithm in multiple forms which are selected from depending on the circumstances. The prime number generator example described in section 4.3 demonstrates this using two versions of the same algorithm at different times - an initial constructive phase followed by a staged approach when array space becomes scarce.

4.2.4 A model for programming dynamic logic arrays

A number of models for programming reconfigurable systems have been developed. For the most part these models are based on expediency and the need to interact with the tools available. Bertin's concept of ``Programmable Active Memories'' is elegant but the resultant ``Perle'' system was programmed in a fairly conventional manner [VuilleminBertin96]. Some systems attempt to provide a programming model which conceals the distinction between hardware and software by having the compiler automatically partition between the two [Athanas93] [ThomaeVandenbout92] [LukLok93] [WoForward94] [GokhaleMarks95]. These are not particularly efficient at creating circuits from conventional code yet, but are improving. Some researchers conceive of ``virtual logic'' arrays where sections of code are switched in and out of the logic array at runtime as required [Gokhale94]. The methods of communication between these virtual circuits are problematic - conventional circuits are not capable of being switched in and out transparently in this way.

The Switchblade architecture described in this thesis pursues a virtual logic model. It attempts to improve the feasibility of this model by providing specific hardware support for logic virtualisation and addresses the problem of communication between virtual logic modules.

4.2.4.1 The future

Section 4.2 has described a number of approaches to algorithm design, each of which provides definite advantages under certain circumstances. Some current systems provide a division between algorithms to be implemented in hardware and those to be executed in software. Using the approaches described above it can be seen that in a purely logic array based system the distinction between sequential code and logic can be quite blurred, and in fact there are a number of other ways of expressing the same algorithms. This possibility is not merely a theoretical curiosity - significant performance improvements can be found through appropriate choice of approach.

As with the current distinction between sequential code and logic, algorithms must currently be specifically developed by hand to use whichever approach is best. The alternative is to develop a compiler which can automatically determine which of the approaches is suitable and automatically transmute the algorithm into one based on a different approach. This is not a trivial problem, and may well be an important research area of the future.

The current popular methods of expressing algorithms all tend to be biased toward the traditional sequential mode of computation. This is unsurprising since they are designed by current programmers - who are accustomed to programming using sequential languages. It is difficult to translate these algorithms into circuit definitions efficiently. It is likewise difficult to translate circuit definitions into sequential algorithms efficiently. It is even more difficult to convert algorithms into functionally equivalent but faster algorithms based on some assessment of what approach may be more suitable for given situations.

From this it seems clear that our current algorithms are too involved with their execution environment. They are based on assumptions about that environment which may not be true in a changeable reconfigurable logic array system. Some way of neatly describing the essence of an algorithm without these assumptions is needed. A meta-algorithm, if you will.

Unfortunately the solution to this problem is not clear. It's also beyond the scope of this thesis. However this must be considered a prime research area in the next few years if reconfigurable computing is to advance.

4.3 Prime number generation case study

This project was designed to demonstrate the programming approaches discussed earlier in this chapter, and to test the supporting features offered by the DRAMA architecture. The DRAMA architecture was designed in concert with the programming approaches to provide a test bed for their evaluation. As well as creating a sizeable and potentially useful application for the DRAMA architecture as proof of concept, the primes project explored the performance advantages of DRAMA's architectural features [SaleebaPose96] [SaleebaPose96a].

Prime number generation was chosen as it did not have a bias towards floating point or other computation better suited to special-purpose coprocessors. Similar algorithms could find useful application in cryptography, where prime numbers are heavily used in creating code keys [RivestShamir79]. The algorithm presented here could also be adapted to factorisation, and hence code breaking.

4.3.1 Primes

The generation of prime numbers tends to reduce to a related problem - determining if a given number has any factors. A great deal of work has been done on factorisation algorithms over the years, and Knuth gives a good survey of these [Knuth81]. An early method was devised by Fermat [Dickson52], and is the basis for many of the other methods.

Probabilistic techniques for primality testing are also possible. A Monte Carlo-type method for factorisation was described by Pollard [Pollard75], and a probabilistic primality test [SolovayStrassen77] [SolovayStrassen78], was found to be fast compared to other methods. While it did not produce deterministic results, the chance of error could be shown to be vanishingly small.

A continued fractions approach is well suited to finding large factors of large numbers [LehmerPowers31] [BrilliartMorrison75]. This method produces factors very quickly, with O(Nepsilon(N)) , where epsilon(N) » 0 as N » infinity.

Sieve techniques have the advantage of not requiring complex maths, which makes them uniquely suited to hardware implementation. A survey of sieve methods is given in [Wunderlich67]. A sieve machine is described in [Lehmer33]. A 1965 hardware delay-line implementation was capable of processing one million numbers per second. A similar technique is also adopted for this paper as sieves provide the greatest effect for the least hardware.

A simple sieve requires one bit of storage for every number tested. The algorithm used here modifies the algorithm so that the sieve values are generated on the fly rather than stored.

4.3.2 Algorithm

The algorithm used in this device uses a type of parallel-sieve technique and relies on the fact that nearly all potential primes will be eliminated very quickly by an initial, limited sieve. In fact, on average the algorithm eliminates close to two numbers as primes per clock cycle. The trickle of numbers which are not immediately eliminated must be exhaustively checked for primality at considerably greater expense.

One advantage of the algorithm is that it is able to exploit the primes already produced to check for factors in later numbers. The availability of these primes drastically reduces the search space when looking for factors.

The algorithm is split into two main stages. These stages do not go through their sequence only once - the algorithm is designed to cycle between the stages indefinitely as more and more primes are generated. In fact the algorithm has no termination condition - results will be generated until the storage space for primes is exhausted. Even then, the algorithm can conceivably continue producing results to a stream device up to the square of the last prime stored.

Stated simply, the algorithm generates primes by checking each successive integer for divisibility by each prime number less than or equal to its square root. Any integer with no prime factors is added to the list of primes.

The reason for splitting the algorithm into two stages is that the issues involved in creating the first-guess ``candidate'' primes are different from those involved with checking all the possible factors of a candidate. Since the vast majority of numbers can be eliminated in the first phase it makes sense to optimise this case as heavily as possible.

The first stage generates an initial set of candidate primes by testing each successive number and eliminating the primes from two upwards as factors. The upper limit on number of primes that can be checked is determined by the size of the available logic array. This stage produces a list of candidate primes in a buffer external to the logic array. These candidate primes have been checked for divisibility by low-valued primes, but may not have had all possible prime factors checked.

The second stage uses specially created logic array configurations to quickly check batches of higher-valued prime numbers against the lists of candidate primes. The results are stored in a second external buffer. When all the candidates have been checked against the primes which a configuration has been designed to check for, a new configuration is created to check for the next batch of possible factors. The candidates are passed back through the logic array to the first buffer, eliminating more potential primes in the process. This stage can repeat until all the potential factors have been exhausted, and the contents of the final buffer are declared prime.

This algorithm is an example of the ``staged approach'' described earlier in this chapter. The sieve algorithm is also inherently an ``enumerative approach'' algorithm.

4.3.3 Candidate generation

Candidate generation is the first stage of the algorithm. The candidate generator is capable of generating a candidate prime every few clock cycles. This candidate has already been checked for divisibility by a number of low-valued primes.

In essence the procedure is to count upwards through the integers, testing a new integer for primality on each clock cycle. As an optimisation only the odd integers are checked. These checks are carried out with a high degree of parallelism, with a number of factor checks occurring every cycle.

To qualify as a candidate prime, an integer must not have a prime factor in a limited range from 2 upwards. The upper limit is determined by the size of the logic array since each additional factor checked incurs an expense in logic elements. Given infinite circuit resources the first stage could generate all the primes itself, obviating the need for the latter stage. Unfortunately, realistic limits on the number of available array elements necessitate the use of a second stage.

Every single clock cycle another number is checked for primality. All the factors which have factor checkers in hardware are checked in parallel, simultaneously. If on any given clock cycle all the factor-checking units fail to find a factor, the number is saved on the list of candidate primes.

Ideally as many factors as possible should be checked before a number is declared a candidate and passed to the next stage. Section 3 gives simulation results describing the relationship between the number of factors checked by this first stage and the overall performance of the system.

Figure [4.3 - "Candidate generator"]

4.3.3.1 Counting factors

The factor-checking circuits are very simple. Essentially each one is a divide-by-n counter, where n is the prime factor to be tested. Each counter starts at zero and increments up to n-1, after which it returns to zero. Since the period of the counter is the same as the value of the factor, each zero output of the counter corresponds to another number divisible by that factor. Figure4.3 shows a partial circuit of the first stage.

Interestingly, the fact that each clock indicates a step by two rather than one does not affect the period of the counters, only the point at which they are initialised. In other words if the counter for prime factor seven is reset to zero on the clock cycle which corresponds to the value seven, then the next time it will indicate a factor is seven clocks later, on the cycle corresponding to the value twenty-one. The factor for fourteen never occurs as it is even and there are no clock cycles for even numbers.

By ORing the return-to-zero signals on each counter it is possible to determine if any of the factor checkers has found a factor each cycle. If none of the factor checkers has found a factor the value is a candidate prime and can be stored.

4.3.3.2 True primes

Any prime generated by the circuit is either a candidate prime or a true prime. True primes will be generated from the time that the circuit starts running until the tested value reaches the square of ``the first prime which cannot be implemented as a factor checker''. This is the first value which has a factor which is not testable by the candidate generator due to the limited logic resources.

True primes are stored directly to the result buffer in main memory using the external memory interface.

4.3.3.3 Self-generation

When the circuit is first started it knows only about the primes one and two. As it discovers more primes additional factor checkers are added for these new values. This is possible due to the self-reconfiguring nature of DRAMA. An interface is provided which permits the device to change its own configuration as it operates. In this case the feature is used to create a new factor checker containing a divide-by-n counter where n is the value just checked.

The self-generation phase is an example of the ``constructive approach'' described earlier in this chapter.

The algorithm for creating the factor checker is simple enough but expensive of logic array space. For this reason it is stored in a separate configuration, as will be described shortly. The first thing it does is find an appropriate area of free logic to configure. This is easy on DRAMA as elements can be allocated linearly. A configuration for a simple ripple counter is created, with the reset function being determined by the bit pattern of the factor being checked. The generator's main OR function is modified to include the output of the new circuit.

Figure [4.4 - "A general outline of execution"]

4.3.4 Phases

The progress of the first stage can be split into three phases - firstly the self-generation phase (which also produces the low-valued true primes), next the true prime generation phase, and finally the candidate generation phase, which can recur indefinitely. See figure4.4. It may seem overly complex to switch between phases this way, but in fact most of the circuit remains identical - the change in phase merely indicates that a small subsection of the array has switched from one configuration to another. This is an advantage of the ``multiple configuration'' feature of DRAMA, which permits arbitrary sections of circuit to change configuration instantly on receipt of circuit stimuli.

Multiple configuration is also used heavily during the self-generation phase. When a new configuration is to be created the normal clocking scheme and some of the supporting logic is switched out while the new factor checker is constructed by a different configuration. When this is complete the normal generator circuitry is switched back in and continues as if uninterrupted.

4.3.4.1 Calculating squares

Yet another configuration is used for calculating squares. The square of the first prime which can't fit in the device as a factor checker is used to determine the transition from true prime generation to candidate prime generation. On the transition from the self-generation phase to the true prime phase the discovery of the first true prime causes the system to switch configurations, perform a simple shift/add multiply to square the current number, and then modify the configuration again.

The modified configuration sets an AND gate to monitor the current value being tested and inform the system when it's time to change to candidate generation phase. In non-reconfigurable systems this function would be performed by a latch and a comparator, at greater expense in logic array elements. With self-reconfiguration we can use a simple AND gate and modify it to suit. This is an example of run-time data folding, one of the approaches described earlier.

Realistically speaking it is not strictly necessary to start with a bare-bones candidate generator and bootstrap an initial candidate generator from there. It would be easy enough to generate a list of low-valued primes and create a configuration for the initial candidate generator beforehand, however the algorithm as described vividly demonstrates the power of self-reconfiguration.

4.3.4.2 Storing candidates

Candidate primes are stored differently from true primes - they are saved to one of two external RAMs connected to the logic array. These RAMs act as a double-buffered temporary result area. At any time the logic array may be reading one buffer and writing to the other. While later stages use double-buffering, the candidate generator needs only to write results to one buffer.

The way that the values are stored is slightly novel. Rather than storing the absolute binary value of the candidate prime, the values are stored as differences from the previous value. The difference between primes is always even, so the least significant bit need not be stored. Also, simulation shows that for the first million primes on average candidate primes were 16.5 higher than the previous candidate. None were found greater than 154 from the previous candidate, so 16 bits was deemed ample for storing each difference. This is clearly more economical than storing the (potentially very high valued) numbers in their raw form.

Another optimisation might be Huffman coding, which would drastically improve the efficiency of use of the buffers, and hence improve the overall performance of the system.

4.3.4.3 Buffer full

When the candidate buffer fills up, the first stage must come to an end. Control will be passed to the second stage by means of switching configurations. At some later stage when all the available candidate primes have been processed by the later stages, control will return to the first stage so that more candidates can be generated. This highlights an important feature of the way DRAMA handles its multiple configurations - when a configuration is switched out its state is stored with it. This means that restoring the candidate generator to the exact place where it left off is simply a matter of switching its configuration back in.

Figure [4.5 - "Factor clearer"]

4.3.5 Clearing factors

The purpose of the second stage is to check the list of candidate primes and eliminate any with factors. Since the hardware is incapable of checking all the factors simultaneously, the algorithm divides the task into groups of factors to be checked and iteratively checks each group of factors against the list of candidates. On each iteration the candidates are read one by one from the source buffer, checked for factors, and candidates for which no factors have been found are written to the destination buffer.

After each pass a configuration with the next set of factors to check is established and the process is repeated. The only difference is that the source buffer and destination buffer swap their roles with each pass, performing a sort of ``sifting'' action on the candidates as they are ``shaken'' back and forwards between the two buffers.

4.3.5.1 Factor clearing

The process of checking factors in the second stage is similar in concept to that used in the first stage, though the procedure is slightly more complex. The checkers (called ``clearing units'' here to distinguish them from the factor checkers of the first stage) use the same technique of counting up the moduli of each of the factors being tested, however unlike stage one they can't count in predictable steps.

The stream of candidate primes coming in from the source buffer is encoded as differences, so instead of incrementing the modulus counters by a fixed step as in stage one, the difference is added to the values of each of the counters.

As with stage one the counters have a period equal to the value of the factor being cleared. In this case though the counters count from a value p part way through the output range up to a state with all bits set. This approach is designed to minimise the circuit complexity and hence space that each clearing unit consumes. When a counter overflows p is added to the counter to return it to the appropriate range (figure 4.5). In this way the period of the counters is kept correct. p is simply the two's complement of the period.

When a counter overflows the output is quickly compared to zero before the addition takes place. If it was non-zero on all the clearing units, no factors were found and the candidate is written to the destination buffer. If on the other hand factors were found, then this candidate can be discarded. The difference values of discarded candidates are accrued in an output difference register. When the next factorless candidate is written it will have the extra differences added to it to keep the differences correctly synchronised.

The time taken to check a candidate against the factors in a configuration is two cycles - the first cycle adds the difference to the counter and checks if the result is zero. The second cycle adds p to the counter if an overflow occurred. These two steps each trigger parallel computation on all the factor clearers on the array, eliminating many factors in one hit.

4.3.5.2 Configuring the clearer

The configuration of factor clearing circuits takes a double-buffered approach. A small area of the logic array is devoted to a reconfiguration controller which operates in parallel with factor clearing. As all the candidate primes are sifted across from one buffer to the other, testing against one set of factors, the reconfiguration controller is fetching the next set of factors and creating a configuration to perform the task of clearing these factors.

Two configuration stores are set aside for factor clearing circuits. At any given time during the second stage one of the configurations is being used to clear factors and the other is being configured with the next set of factors. After each iteration the roles of the two configurations are reversed, double-buffering the reconfiguration process.

This reconfiguration controller is a simple state machine designed to use the minimum amount of logic array space. Its job is greatly simplified by the fact that the configurations already exist in an almost complete state in the configuration store, and are designed so that only relatively few changes need be made to modify them to check for different factors.

Of course this same function could have been performed by statically-configured logic but reconfiguration has the advantage of allowing the size of the clearing units to be determined by necessity, allowing more to be squeezed into the available space.

A side effect of the approach taken here is that the same configurations are created again and again as each new set of candidates from stage one are checked against the same factors. This initially appeared to be an expensive choice, and storing the configurations off-chip was considered. The storage requirements of this would have been prohibitive however, and a simplified design of clearing unit helped reduce the complexity of the reconfiguration controller to a point where the cost of creating the configurations from scratch was acceptable.

4.3.5.3 Prepping the moduli

Before a configuration can be used, each counter must be initialised with the correct modulus value for the starting candidate value modulo that factor. After the reconfiguration controller has finished its task for each factor, the same logic array area is replaced with a configuration for a simplified long divider. This divides the candidate by the factor, and stores the modulus in the appropriate counter by means of the self-reconfiguration interface.

When a configuration is first used, its state is such that one cycle of the ``add p'' state occurs, bringing the modulus into the correct range.

4.3.5.4 Producing results

The second stage is complete when all the factors up to the square root of the largest candidate have been tested. After configuration, the square of the next higher factor is calculated. Verified candidates below this limit are true primes, and can be output as results. Otherwise they still have remaining factors to check, and are passed to the destination buffer for the next pass.

When the external memory runs out of space it is still possible to produce results, as long as there is some type of stream device to output them to. The highest number which can be checked for primality is the square of the largest number stored in the result area. The result numbers are necessary as they are used in generating the second stage factor clearers. This maximum value is likely a very large number, and reaching it would probably require considerable patience on the part of the operators and their descendants.

4.3.6 Simulation

The algorithm was programmed in a form suitable for running under the DRAMA simulator. It was found that the design could fit in a multiple-configuration device with eight alternate configurations available.

The effectiveness of the algorithm is highly dependent on the size of the available logic array. The more numbers eliminated in the first stage, and hence the fewer false candidates produced, the more effectively the external buffers will be used, and hence the faster the system will run. Also, the second stage has a considerable overhead in the size of the reconfiguration controller. The effect of this can only be minimised by use of a relatively large logic array.

Logic array size Factor checkers Average factor clearers False candidates
300313.035.4%
3844010.132.8%
5125216.829.8%
7687632.725.5%
10249947.011.2%
128012063.05.9%
153614177.03.1%

Table [4.1 - "False Candidates"]

Simulations were performed with various sizes of array. The results given in table 4.1 show the percentage of false candidates produced for various array sizes. These results were collected for a short run of primes up to one million. The proportion of false candidates is determined by the number of factor checkers. This in turn is determined by the size of the logic array. For comparison the number of factor clearers from the second stage is also shown. Note that since the size of the factors change from pass to pass the number of them which can fit in the logic array varies. For this reason the number of factor clearers is expressed as an average.

The proportion of false candidate primes reduces gradually as the number of factor checkers is increased. The reason for the more marked drop towards the end of the sample space is that the 141 prime factors being checked for reach the value 811, most of the way to the square root of the maximum value being tested.

Figure [4.6 - "Performance vs. array size"]

Figure 4.6 shows the rate of output of prime numbers graphed against the size of the logic array and the runtime elapsed. The first obvious feature of the graph is that the rate of production of primes is very irregular. In fact the graph approximates figure 4.4, with flat candidate generation phases alternating with steep factor clearing stages.

The larger logic arrays are able to produce considerably more primes in the initial spate - the number of primes they can generate without going to the second stage varies with the square of the number of factor checkers.

The flat stage indicates the candidate generation phase, and takes about the same time in each case. Of course the larger arrays produce a smaller proportion of false candidates.

The reconfiguration overhead in the second stage is large enough that a 300-element array is only just viable. Its performance is very bad compared to larger arrays, indicative of ``thrashing''. As the capacity of the arrays increases the number of primes which are produced in the second stage increases dramatically.

It is to be expected that for longer runs the rate of production of primes will decrease gradually over time. Not only do primes become more rare the higher their values become, it is also more expensive to check them for primality.

Given a pessimistic clock speed of 10MHz and an array size of a thousand elements, the expected output is on the order of a million primes per second for the first few seconds of operation. By comparison a 100MHz SGI Indy took 18.47s to calculate the first million primes using the highly-optimised ``primes'' program which comes standard with BSD UNIX. The machine in question was an IP22 model Indy with a MIPS R4000 RISC processor, 32Mb of RAM, 8k byte instruction and data primary caches and a 1Mb unified secondary cache.

One potential problem with the second stage occurs if the number of candidates in the buffer is small enough that they are processed more quickly than the next configuration can be loaded. Interlocks exist to prevent premature changeover of configurations, but if this situation was to occur often the efficiency of the process would be greatly reduced. Simulation showed that in practice this situation occurs rarely, if ever. In fact it might be considered worthwhile to reduce the length of the external buffers from the experimental size of 64k words if their expense is a significant consideration. This would increase the chance of premature completion, but this does not appear to be a limiting factor in operation.

4.3.7 Review of the primes study

As an exercise in converting a familiar algorithm to an unusual architecture this project showed very clearly that the types of algorithms which are likely to perform well on conventional computers may not necessarily be well adapted to other platforms. The approaches used here were specifically geared to reconfigurable logic array architectures, and the performance gains over an algorithm on a conventional architecture were considerable.

The experience gained in this project indicates that new programming approaches and architectures to facilitate them must be developed to fully exploit the benefits offered by this technology.

The architectural features of DRAMA which proved most useful were multiple configuration stores, partial reconfiguration and self-reconfiguration. The disadvantage of exploiting these features was a large increase in complexity, and a corresponding increase in the incidence of bugs. Automatic tools could potentially alleviate these problems in the future.

As a fast prime number generator, the device is quite successful. The main disadvantage is the algorithmic complexity of checking every potential factor of each candidate prime. An algorithmic improvement could be achieved by adding a third stage to the algorithm with a continued fractions approach to factorisation. This would of course add further complexity to the system, but would improve performance once the numbers being produced started to reach high values.

4.4 Conclusions

Programming for computing devices based on logic arrays is not the same as programming for conventional general purpose processors. Circuit synthesis from conventional algorithms is possible but fails to exploit the abilities of the logic array, resulting in inferior performance.

There are a number of general approaches to algorithm design which can result in significantly improved performance over conventional methods when the algorithms are targeted to logic arrays. In particular certain approaches can exploit specific architectural features which may be available on a reconfigurable logic array. Functionally equivalent algorithms can be expressed in many different ways, each of which may be better suited to a target architecture with different attributes or to a different data set.

Von Neumann style conventional processors are merely one data point in the spectrum of possible processors. Logic arrays are another. The advent of reconfigurable logic arrays gives rise to the possibility of gradations between these two extremes, ranging from fully sequential programs to bit-parallel circuits.

The split between programming techniques for conventional processors and programming techniques which are better suited to logic array processors is an indication of a larger problem - some more general way of expressing algorithms which is less dependent on the target architecture is needed. The future may hold systems which can automatically produce an appropriate algorithm from a combination of a stated function and a description of the operating environment.

In the meantime we still have a great deal to learn about how best to program logic arrays in order to extract the maximum performance out of them. The programming approaches outlined in this chapter when combined with architectural features to complement them result in some significant advances in performance over previous systems.