Virtualised logic array

6. Virtualised logic array

6.1 Introduction

The trend towards using programmable logic arrays as computing devices has led to the question of whether they could ever replace general purpose processors. One obstacle to this is their inability to multi-task in any convenient way. A general purpose computing device should be able to run multiple tasks and should have some reasonable way to deal with large numbers of tasks executing simultaneously.

One problem with using logic arrays as computing devices is that they are essentially single-process devices. Traditional logic arrays are very slow to reconfigure, so slow that switching between configurations to change tasks would not really be viable. Even so, quite a lot of research has been done on run-time reconfiguration of current-day devices - the logic array is a scarce enough resource that in some applications it is worthwhile reconfiguring on the fly [DaalenJeavons93] [Daumas94] [Dehon94] [Dehon96a] [Eldredge93] [EldredgeHutchings94] [EldredgeHutchings94b] [EldredgeHutchings96] [FrenchTaylor93] [Foulk93] [GokhaleMarks95] [GuntherMilne96] [Hadley95] [HadleyHutchings95] [HadleyHutchings95b] [HastieCliff90] [HutchingsWirthlin95] [LemoineMerceron95] [Lysaght91] [LysaghtDunlop93] [LysaghtStockwood94] [RossVellacott93] [Saleeba93] [SinghBellec94] [SinghHogg96] [SchonerJones95] [Shand95] [Tau95] [WirthlinHutchings95] [WirthlinHutchings95b] [WirthlinHutchings96] [WirthlinHutchings97a]. Still, while reconfiguration can be done at runtime this does not serve to make multi-tasking more viable. In fact it demonstrates a desire to make even more use of the existing resources even for just a single task. Run-time reconfiguration references are covered in more detail in chapter 2.

Dynamically reconfigurable architectures such as DRAMA [Saleeba95] are able to switch between configurations rapidly, but are still limited by the fact that the circuit synthesis processes which generate the configurations are very slow and inflexible, and must be done in advance. Even though with these architectures it is quick to change between configurations, those configurations must be generated with a specific fixed location in mind, giving no real opportunity to place a circuit at a different location at runtime. This is unfortunate, since to be able to parallel process it would be convenient to have parts of more than one program running on the array at once. What's more at any given time there may well be unused space in the array available to accommodate an extra sub-circuit but this area may not be at the specific place that the sub-circuit was originally intended to appear at. Even if the circuit could physically be placed at the new location, there may well be no routing resources available to connect it to wherever it is needed. The fundamental conclusion is this - traditional logic arrays are not really designed with this kind of use in mind.

This chapter describes a new scheme which makes it possible to simultaneously execute independently routed tasks on the one logic array - achieving goals of multitasking, task-parallel processing and virtualisation of the logic array. The implementation issues which accompany this approach are discussed and a description of the design decisions behind Switchblade is given.

6.2 Segmentation

An ideal way of allowing multiple tasks access to the common logic resource of a logic array would be akin to the way conventional processors provide multiple tasks with access to the common memory resource - they use a virtual memory system or in this case the equivalent would be a virtual logic system. Sections of circuits in heavy use are paged in and running while latent pages are in some form of secondary storage and do not need to consume any space on the logic array at all. As with virtual memory, virtual pages can potentially be physically located anywhere in the logic array.

Much as conventional program code can be divided into pages and paged in and out, circuits can be divided up into partial hardware configurations and switched in when necessary. That much is intuitively obvious but unfortunately the reality is a lot more complex.

Circuits normally communicate with each other on many logic lines simultaneously. ``Paging out'' a section of circuit may leave a number of lines in an undefined state, with no knowledge of when the circuit needed to be paged in again when it was next needed. MMUs have no equivalent problem as they deal with essentially serial access to memory. The parallel nature of logic circuits makes it very hard to remove a part of a circuit without damaging the circuit to the point where it won't work at all. Clearly a more sophisticated technique than simple paging is necessary.

Worse than this, it's not really viable to locate pages at random in the logic array. The most scarce resource in a logic array is interconnect, and random placement of logic would lead very quickly to these interconnect resources being exhausted. It is convenient to be able to place virtual pages at random physical locations as this makes the paging algorithm simpler and more flexible, but some system of keeping related circuitry together as much as possible is necessary to keep interconnect small, fast and efficient.

Another factor which makes a logic paging system more complex than a memory paging system is the two dimensional nature of the data being paged. It is desirable to page sub-circuits in functional blocks, yet circuit synthesis algorithms will often create such blocks in ragged shapes which are very difficult to page efficiently. After a period of paging ragged-shaped circuits into random locations on a two-dimensional logic array much of the logic area would become unusable due to the jigsaw-like difficulty in slotting the circuits together. This is analogous to memory fragmentation in conventional computers, except complicated greatly by a second dimension.

6.2.1 Contour table

The Switchblade system implements a virtual logic system based around a modified concept of paging augmented with a special inter-page communication system. The paging system assumes fixed-sized pages, each page corresponding to a set two-dimensional area of the logic array. Sub-circuits of an algorithm - called ``contours'' - are constrained to use at least one page, but may use any number of pages up to the size of the entire logic array.

Unlike paged virtual memory systems where individual pages can be paged in and out at will, this is not possible in a virtual logic system where intra-circuit connections must not be severed. A single page is not the quantum of switching - instead a contour consists of a number of pages which are all switched simultaneously, as a unit. Such ``contours'' are arranged in a fixed two-dimensional layout, which as the name implies may be any shape. Generally speaking rectangular shapes are preferred as they will cause less fragmentation but no restriction is imposed by the system. Interconnection and layout of pages within a contour is fixed, and no special care need be taken when passing data between pages. Effectively each contour operates as a continuous, uninterrupted two-dimensional area of logic.

For performance and placement reasons the number of pages in a contour should be kept to a minimum - the larger and more irregular in shape a contour is the more difficult it is to place in conjunction with other such contours. Large, irregular contours will result in bad fragmentation when the system performance degrades - further contributing to the drop in performance. On the other hand the cost of interconnection within contours is lower than that between contours, so there is a tradeoff between speed and array usage. The synthesis system must make reasonable decisions about where to place the boundaries between contours. Synthesis is beyond the scope of this thesis so algorithms for achieving this ``reasonable'' tradeoff will not be discussed here.

The contour table keeps track of each virtual page in the system and what status it has. The status of all pages in a contour is the same. This may be one of four states, as shown in figure 6.1:

Figure [6.1 - "The virtual logic hierarchy"]

A small algorithm should be able to operate entirely ``hard'' - all in logic at once and running at maximum speed. Larger algorithms which rely on Switchblade's multiple configuration stores will be have some contours ``hard'' and some ``stored'' at any given time. The system will cause a ``fault'' every time data is sent to an unavailable contour, perhaps causing it to be switched in.

A ``soft'' or ``swapped'' contour implies one of two things - either the contour in question is little-used or logic resources are so tight that it has been paged out prematurely. In some cases it will never be useful to have these sections of code running in hardware - they may run so infrequently and take such a disproportionate quantity of resources that they're better run in conventional sequential code. Note that this does not imply a conventional general purpose processor to execute the contour off-array. Instead the ``soft'' contour can be sequentially executed by the logic array itself as described in section 6.7.

It may also be advantageous to map a contour in multiple parts of the logic array simultaneously. This allows it to be conveniently located for different configurations of the circuits around it. These multiple places might not be used at the same time, but can be selected in turn as the environment of circuits around them changed.

Estimates based on simulation and observation of existing circuits led to a choice of page size on Switchblade of four by four logic blocks. This was judged to be coarse enough to keep the number of pages per contour down to a manageable level but fine enough to avoid excessive loss on the perimeter of contours. It also results in the number of physical pages in the system being of roughly the same order as in a conventional computer for the size of logic array chosen. This may not necessarily be meaningful given the differing natures of the systems but it is a comforting indication that the choice is probably in roughly the right ballpark.

Paging of hardware should not be confused with memory paging. A complete system such as Switchblade will have both forms of paging available to it. A conventional MMU provides virtual memory functions to the processor in much the same way that it does to a conventional processor. Sequential processes use the MMU to gain paged access to a virtual address space containing their code. Data in virtual memory can be accessed by either sequential or hardware processes.

More interestingly, configuration information can be stored with three levels of virtualisation - initially configured in hardware, then switched out to an alternate configuration, then paged out of the logic array into memory, then paged out of memory on to disk. This many levels of paging hierarchy allows paging to occur ``softly'' - paging a contour down a single level of the hierarchy can occur without incurring excessive costs in paging it back in. Since resources at the lowest level - the logic array level - are very scarce, a large amount of high-speed paging is to be expected at that level so it must be handled quickly.

6.3 Dataflow interlocks

Communication between contours is more complicated and slower than communication within contours. This is due to the fact that the data sources and destinations for a given contour may or may not be paged into the system at any given time, and even if paged in might be located anywhere on the array.

A dataflow approach is used to resolve this issue. Peripheral logic lines are regarded as data sources and sinks and contours are regarded as functional units. Data are passed between two functional units directly if they are both paged in, otherwise a ``fault'' is generated when an attempt is made to pass data to a contour which is not available.

When a fault occurs the data must be passed somehow to the destination contour. The most convenient case is if the missing contour is just in an inactive configuration store somewhere on the logic array. The paging algorithm will cause this configuration store to become active and will ensure that valid interconnect exists between the two contours. When the new contour is available the waiting data will be presented to it and computation will continue.

If the destination contour is not available in the configuration store one of two things may occur - either the new contour may be configured in or it may be executed sequentially instead. Which of these is done is entirely at the discretion of the paging algorithm.

Existing between contours are the mechanisms which provide this ability to do asynchronous inter-contour communication and to raise faults when necessary. This feature is designed to give the advantages of direct hardware connection for heavily-used communication lines along with the economy of a routed message passing system for less frequently used ones. This mechanism is described in detail in section 6.5.

In essence, contours communicate with each other by means of message ports. These message ports may be configured to communicate either in a connection-based fashion or in a datagram fashion. They transfer a configurable number of bits with each transaction, and have a handshaking mechanism to control the flow of data. They can also be configured to send fault messages to the operating system under certain conditions. This message passing mechanism provides all the necessary interlock features for communication between contours.

6.3.1 Mutual exclusion

In any parallel processing system some form of mutual exclusion is required to ensure that multiple processes don't modify data at the same time. Within a contour this is not a problem as the logic circuit can be designed to be internally consistent through the use of synchronous or timed logic. Communication between contours however can cause synchronisation problems as all inter-contour messages are asynchronous, making it impossible to be sure in advance exactly what order events will occur in.

The message passing mechanism provides the fundamentals of synchronisation between tasks. A resource which subject to competition will have an access control contour to prevent uncontrolled contention for the resource. This controlling contour will often be very simple indeed as the message port interface provides an inherent mutually exclusive arbitration mechanism.

6.4 Support for paging

The system is designed around the ability to switch code in and out of hardware, as well as in and out of configuration store and in and out of memory. This unusual four-level paging system requires quite a complicated paging algorithm, so it is important to provide it with as much information as possible about the operation of the system so it can make sensible decisions about how to operate.

It is quite difficult for the operating system to determine if a contour is performing a computing function or if it is just in a wait state. Contours in wait states are obviously prime candidates for paging out, but without accurate knowledge of how active contours are it is difficult to make these decisions correctly. The following section describes some methods which can be used to gather this information in support of the paging algorithm.

6.4.1 Identifying latent contours

The most direct hint that can be provided to the operating system is when a contour attempts to write on a port and is unable to due to the read port's queue being already full. This causes a fault which the operating system can use to switch the contour out if it wants to.

Note that this doesn't necessarily mean that the contour isn't doing anything - it might well be still computing even though its write is blocked. Still, in most cases a blocked write will mean that the contour is unable to proceed so an arbitrary rule is made that being blocked on any write port is implies a contour's willingness to be paged out. This affects the way that contours are designed - generally speaking buffering between contours is better left to operating system buffers rather than providing the buffers in contour logic. This way the operating system can understand better whether the contour is idle and will cause fewer cases where the contour is switched out when it shouldn't have been.

Correspondingly, when an overflowed queue is emptied to below its low-water mark a fault is sent to the operating system to request that the communicating contour be switched back in. The paging algorithm may or may not decide to page the contour in immediately, depending on load. It might for instance elect to continue executing the contour as a sequential process.

The low-water mark depends on the implementation of the read queue. The default read queue has only a single buffer entry available so its low-water mark is reached when this single entry is emptied. This will cause a fault to be sent which will in turn cause the source contour to be paged back in and hence prompt delivery of the following message (which originally overflowed the queue and caused the contour to be paged out). This means the buffer isn't likely to be empty for long, but a buffer overflow fault won't occur until yet another write is attempted by the source contour.

Each read port or write port can have a counter attached to it which keeps track of the number of transmissions on that port. This can be used to estimate how heavily a contour is being used. When the counter reaches its maximum it stops counting until cleared by the operating system. The size of this counter is configurable, but often a single bit will suffice - indicating that the port has been used rather than giving detail on how many times it has been used. The decision of how wide to make these counters depends on how often the paging algorithm is going to check and clear them. A wider counter is more useful if the system is checking infrequently.

6.4.1.1 Active notification

All of the methods mentioned to this point are imprecise - there are cases when contours will be misidentified as latent and will be erroneously switched out. This could affect the performance of the system adversely or in a pathological case cause an almost-deadlock situation where two communicating contours keep each other's read queues full and cause themselves to be switched out before they have a chance to clear their own read queues. This case would eventually clear itself due to the finite time that it takes the operating system to switch tasks, but the performance would be very poor.

A far more effective way of keeping the operating system informed of the status of a contour is to have the contour provide that information directly. If a contour is incapable of continuing it can send a ``wait fault'' to the operating system saying that it wishes to be paged out until further notice. Note that this fault will not necessarily cause the contour to be paged out immediately - it only provides advice to the operating system that it is a good candidate for paging. The paging algorithm may have other information which causes it to choose not to page the contour out - such as there being no great constraint on logic array space or statistics indicating that this contour is likely to be used again shortly.

Similarly, if a contour doesn't want to go to the effort of working out when it is stalled and explicitly requesting page-outs it can perform an equivalent function by doing the reverse - telling the operating system when it's busy. This can be done using the normal message counter mechanism in a slightly unusual way - a message port is configured to carry no data and to lead nowhere. Its only function is to increment a port usage counter or set a usage bit.

In fact ports without any data are not uncommon - they can be used to handshake with other contours. This can be used to pass control to a successor contour, for instance. A port which transmits to nowhere is more unusual - it requires a special address which is resolved by the operating system to only connect to a counter and not involve transmission at all.

This combination of features - the wait fault and the activity message provide enough information for the paging algorithm to do a good job of selecting contours to page out. They do however require either explicit code from the programmer to support them or part of the programming system to automatically generate the appropriate hardware. This is discussed at greater length in chapter 4.

In practice, while the basic mechanisms for all these techniques were implemented on the Switchblade simulation, operating system support was only included for the more convenient ``usage bit'' method. Algorithms to support more techniques could have been added but as part of the operating system these were judged to be a subject for later research beyond this thesis.

6.4.2 Bringing contours back in

There are normally only three conditions which will cause a contour to wish to resume execution after it has been paged out. If the contour was paged out because of a blockage on a message write it may be paged back in when the blockage is cleared. If the contour was sitting idle without sufficient data to keep it running it may be paged back in when a message is sent to it. Alternatively the contour may have been paged out simply because the logic array resources were becoming scarce and the system is waiting on processor resources becoming available before the contour can be paged back in.

All other wait conditions are variants on the three above circumstances. A contour waiting on disk I/O for instance is more immediately waiting on a response from the disk device driver, which will come to it in the form of a message on a read port.

6.4.2.1 Wired down contours

It can be argued that hardware interfaces have access to a different form of I/O and hence they might find it useful to be able to wait on that stimulus specifically. This brings us to an important variant on contours - contours that are ``wired down''. These contours are part of the operating system and are designed to operate in a specific physical location in the logic array - usually in association with an I/O port. They can not be shifted in location or paged out. Hardware interfaces are ``wired down'' in the interests of quick response and so that their access to hardware will never be broken by their being paged out. Often these wired down sections will only be small interfaces which send messages to more complex device driver contours through the normal message passing mechanisms.

6.4.2.2 Selective waiting

A contour which has emitted a wait fault to the operating system will have a specific reason for doing so - it may be waiting on a blocked write port clearing or it may be waiting on more input becoming available on a read port. The wait fault message which it sends can specify exactly which stimuli it is interested in - or rather, which read or write ports it's particularly interested in hearing from. Other stimuli will be ignored, or rather will not cause it to be paged back in immediately.

If a message on a read port is ignored it may still indirectly cause a contour to be loaded. If a number of successive messages on the port are ignored the read port's queue will fill and may cause another contour to become blocked. This may result in the paging algorithm choosing to page the contour back in, in an attempt to clear the blockage on that read port.

6.4.2.3 Avoiding ambiguous states

Contours which have been chosen by the paging algorithm to be paged out of hardware are not necessarily unable to continue execution even if it appears that way to the paging algorithm. It is quite possible that they have been poorly selected and are ready to continue operation at full speed. A contour may have been paged out due to a blockage on a write port, but in fact be capable of accepting more data on input and performing useful computation on that. It is very difficult to determine if contours are in these kinds of states.

One solution is to occasionally run paged out contours as software tasks rather than hardware as described in section 6.4.3 - the software version of the code can quickly determine if the code has reached a state where it cannot continue without a change in external stimuli. If this state has been reached the operating system can safely place the contour on a ``blocked'' queue and wait for appropriate input.

Often a contour will not be able to continue computation until a number of conditions are met - usually until messages have been received on a number of read ports. A possible optimisation is to prevent contours from being paged in until all their required stimuli had been received. This prevents cases where a contour was paged back in on receipt of a message, only to be paged back out because another message was required for the contour to be able to continue. This has been empirically tested for hand-coded cases but it isn't clear whether it's practical to use this technique with automatically synthesised contours.

6.4.3 Sequential code

Unlike paging systems on conventional processors, there are actually several levels of paging in this architecture. Code is not only paged between memory and disk, it can be paged from memory to the logic array, and paged from the logic array to backing configuration store.

Another option with this flexible system of paging is to take a contour which has been executing in hardware and continue its execution in software. This is made possible by the fact that both hardware and sequential software versions are available for all code. Code which is executed infrequently is a prime candidate to be paged back to sequential code. No special hardware is provided to aid this transition - the function is provided by the operating system.

More detail on sequential execution is given in section 6.7.

6.5 Routing

Routing between logic blocks takes three forms - local routing, medium-distance routing and contour routing.

Local routing and medium-distance routing provides connections between logic blocks within a contour. Local routing connections lead directly from each logic block to adjacent ones, and are the fastest option. Medium-distance interconnect can be used to carry signals efficiently over longer distances, still within a contour. Both these forms of interconnect are described in the previous chapter. They have no higher-level function - they act as wires connecting the logic blocks, and have no ability to raise a fault to the operating system.

Since configured routes share the same configuration store with everything else, it is possible to reconfigure routes directly from the logic array. One way that this is useful is in creating applications where routes are dynamically reconfigured at run-time. Alternatively configurations can be switched-between under the control of the hardware, allowing rapid changes between entire circuit layouts.

6.5.1 Contour routing

Contour routing offers a whole new form of communication within logic arrays. It operates at a higher level than the other types of routing, providing location independent routing of messages between contours. This effectively virtualises the logic array, making it a homogeneous substrate upon which contours can be placed and relocated transparently.

Both local routing and medium-distance routing are intended for use within contours only. Within contours we can use a circuit synthesis algorithm to create an efficient layout at compile time. Between contours we are reliant on real-time routing which is much slower and more expensive of resources.

The form of communication between logic blocks within a contour also takes a very different form from communication between contours. Within a contour an algorithm is represented directly as a logic circuit, with communication passing down logic lines from anywhere to anywhere else at the whim of the synthesis software. Communication between contours uses a dataflow model as described in the next section.

Circuits within a contour are location independent in as much as they can be placed anywhere on the array and still operate in an internally consistent manner. Their function is not dependent on their location in the array (with the exception of I/O, of course). What prevents them from being individually relocated is the ability to then connect them back into the larger scheme of things. Contour routing fills this gap in current architectures.

6.5.2 Dataflow communication

Contour routing is more flexible than local forms of routing, at the cost of speed. Transmission from one contour to another is independent of both their locations in the logic array, and independent of their location relative to each other. This allows the paging algorithm to place contours wherever it likes within the constraints of routing resources.

Between contours the form that communication takes is based on the transfer of data between contours rather than the logic-levels usually used within a contour. It relies on dataflow analysis from the compiler to define the form of communication to and from each contour. Transmission of a datum is a tightly-defined action involving a group of logic lines being set to a state at a particular time, rather than just allowing logic lines to change state when they will.

Since dataflow deals with quanta of data which are ``results'' - i.e. numbers, strings or whatever the algorithm deals with - these are the appropriate quanta to pass between the contours. Rather than allowing logic lines to change independently of one another, an entire result is sent as a package.

This is not to say that the system models a dataflow architecture - rather dataflow analysis is a suitable technique for dividing an algorithm into contours to make it convenient to pass results around in this form. Communications can take place at the higher level rather than using the free-form logic routing used at the lower level within contours.

Another advantage of passing data packaged in this way is that it permits a whole range of functions not normally possible with conventional logic routing. Normal logic lines need to be connected all the time - who is to say what value a logic line has when it isn't connected, or whether a clock signal can safely be left disconnected from a circuit which is not in use? To permit dynamic paging-in of circuits when they're needed it is convenient to be able to discriminate when actual solid results are being passed from one contour to another. The dataflow technique provides just this.

An incidental effect of the dataflow nature of communication in the contour routing system is it makes ``demand loading'' of contours possible. If initially only the starting contour is paged in it will execute in isolation until it sends a result to another contour. Computation will then be blocked until the new contour is available, at which point processing will continue until either of the contours sends a result to an unavailable contour. This process will continue until a working set of contours is established. This is just one facet of the way contour routing virtualises the Switchblade logic array.

The message ports used by contours to transmit and receive data are really connections to a nearby routing junction. When a contour attempts to send a result to another contour, the junction has the job of forwarding the result to the appropriate destination or raising the appropriate fault to cause the destination contour to be paged in.

6.5.3 Junction structure

At the lowest level routing junctions communicate with a small number of logic blocks in their vicinity. These junctions have access to a longer-distance routing grid which they use to communicate between themselves. Physically speaking, junctions are a thin section of specialised logic between the normal logic spaced at every eight logic blocks. They have their own hierarchical interconnect structure.

The junctions in Switchblade are divided into groups, each of which has its own internal communication structure. Communication between groups is performed by another larger of junctions above the lower level, using another layer of communication lines. This hierarchical structure can continue theoretically forever, with each successive layer providing longer-distance communication over slower links. The higher-level links are also a more rarefied resource, and should be used sparingly. The paging algorithm has the task of ensuring that communicating contours are placed as close together as possible, both to reduce propagation delay and to conserve the higher-level routing resources.

Figure 6.2 shows the physical layout of the routing circuitry on the die.

Figure [6.2 - "The die layout showing virtual routing hardware"]

Junction groups provide areas in which communication is inherently faster. Since large contours can easily bridge junction boundaries, it is possible for contours to communicate with other contours in multiple junction groups simultaneously without using higher levels of junction group.

Central to each group of junctions is a parent junction which forms part of the higher-level junction structure. Rather than providing connections to surrounding logic blocks as the low-level junctions do, these higher-level junctions communicate only with other junctions and hence have a rather different structure.

These ``branch junctions'' have no message port interfaces, so they are free of the interfacing complexities associated with these. In their place they have sixteen child junction interfaces. These interfaces have two streams, one to send messages down to the child and the other to receive messages from the child.

A data path between two contours could be constructed by creating a series of connections from the data source to the destination to provide an electrical connection from the source to the destination. This is equivalent to a circuit-switched telephone call, and is the conventional scheme used in logic arrays. It does however have the disadvantage of tying up scarce routing resources for the entire period of the communication.

Rather than treating communication between contours as a logic operation, Switchblade uses the more efficient scheme of packet switching. Packet switching is a slower process than propagation of a signal down a logic line, but it permits multiplexing of many communications on the one set of inter-junction logic lines. While heavily used data paths can use circuit switched routing for speed, most data paths are used relatively infrequently and won't suffer greatly from being packet switched. At the same time this will produce an enormous saving in routing resources.

The hierarchical routing system used was chosen for its relative simplicity and practicality. More sophisticated routing systems are possible and may be a fruitful area for future study.

6.5.4 Addressing

When a contour is initially created its references to other contours are based on their function rather than their physical location in the logic array. The other contours that a contour communicates with may be located anywhere on the array, or may not even be on the array at all if they're not paged in.

For this reason messages between contours are based on virtual address, where the virtual address specifies the type of contour a message wants to be delivered to and the port on that contour that it should go to. The routing system translates the virtual addresses into a path it can use to deliver the message using what is essentially a small routing table at the parent junction of each junction group.

The virtual addresses used are formatted in the style of password capabilities [AndersonPose86], containing a random component to ensure the ``unguessability'' of message port addresses. The capability access control system is described in section 6.6.

Routing tables in successively higher levels of junction are correspondingly larger since they must contain all the entries of all their children. This is a classic problem in network scalability, and no new method for solving this problem is presented here. A common technique is to allocate virtual address space based on general physical areas. This is not practical on a virtualised logic array as it is necessary to be able to place contours anywhere. A more applicable technique is to have a high-speed route cache in each junction which keeps the more commonly used routes handy, while still having the more obscure routes available if they're needed.

The solution used here is to have a software contour table which provides the definitive reference for routing. When an attempt to send a message fails due to a routing entry not being available a fault is raised which causes a handler to check the contour table and insert the appropriate routes into the appropriate junctions' route caches.

6.5.5 Message passing

Message passing involves sending a result from a message output port to a message input port. In the most trivial case these ports can be connected directly together by logic lines. This provides a high speed form of transmission but is as expensive of routing resources as conventional logic array designs. For low bandwidth or long distance communication paths the result will be taken by a junction, transferred through the routing system and will end up at an input port. The entire transaction looks to both sides as if they have been directly connected except for the time lag involved in the transfer.

Message ports are asynchronous. Given the uncertain distance to the destination and uncertain propagation delays which result, synchronous messages are not effective in communicating between contours. Within contours of course a great deal more is known about the structure of the circuit so synchronous communication can be used without problems.

A message output port consists of a number of ``result'' output lines, a ``send'' output line and an ``acknowledge'' input line. When a result is ready on the output lines the send line is strobed. When the result has been received the acknowledge line is strobed by the recipient and the sender may prepare to send again. This relatively conservative scheme ensures correct operation over a wide variety of response conditions, from the quick response of a directly-connected neighbour contour to the slow response of a system which is thrashing and may not even have both communicating contours in logic at the same time.

This system is similar in concept to conventional asynchronous logic. Instead of providing individual asynchronous logic lines however it operates on groups of lines, and only for inter-contour communication.

When a junction receives a message it firstly checks its routing table to see if it knows what to do with it. If it doesn't, it buffers the message, acknowledges it, and sends it to the parent junction. If it sees that it already has a direct connection from the source to the destination it takes no further part in the transaction as it will already have been acknowledged.

If the junction finds that the message is destined for a logic block connected to the same junction it presents the data to the appropriate inputs on the logic block and passes the acknowledgement back. If no acknowledgement is forthcoming in short order it will latch the data and do the acknowledgement itself. If another message arrives before the destination contour acknowledges the first message the junction will raise an exception to the pager indicating that the source contour can be paged out until the destination contour is ready to receive more data. This is one basic mechanism for selecting contours to page out.

Note that the junction system also solves the input retiming problems which are often encountered in multiple-configuration logic arrays. Rather than providing input registers on a per-logic-block basis such as TSFPGA does [Dehon96a] [Dehon96] the buffering between contours is provided by the junction system on a more economical per-contour basis.

Each routing entry has a small counter associated with it which tracks the number of messages to a particular port. These counters are periodically checked by the paging algorithm to determine which ports would most benefit from faster transmission. If routing resources are available, some routes will be replaced by direct hardware transmission.

If a message reaches the top of the junction hierarchy without finding a destination it can be assumed that the destination contour is not currently in hardware. This means that it needs to be paged in or alternatively that it should be run in sequential code. Either way it passes out of control of the hardware at that point and the source contour becomes a candidate for paging out if it attempts to send another message to the same port before the destination has been able to receive it.

Another possible cause of a message failing to be routed to its destination is if routing entries for that destination are not available at all of the junctions from the source to the destination. This happens if the routing cache in one of the junctions has expired the necessary entry. In this case the contour table must be consulted to determine which path the message should take.

Message ports can theoretically have any number of data bits associated with them, though in practice this means that a large message port may be split over multiple physical port blocks. The size of a port cannot change over its lifetime - an individual port always transmits or receives the same number of bits.

Messages need not contain any data. A message port with no data bits is merely a handshaking mechanism. This is useful for tasks which use a number of contours in sequence. Each contour passes control to the next contour by sending an empty ``finished'' message.

6.5.6 End-to-end flow control

The message passing mechanism provides reliable, in-order delivery of messages between contours. Order is guaranteed on a point-to-point message stream but not necessarily in a parallel system where messages are being sent from several places almost simultaneously to a single message port. It also provides end-to-end flow control in the case where a message is blocked from reception for some reason.

These blockages usually occur if the receiving contour is not accepting data as quickly as it is being sent. In the trivial case of two message ports wired directly together the data is passed directly from the source contour to the receiving one and the handshaking lines are connected together. The handshaking lines provide flow control directly from each contour to the other.

For longer-haul transmissions the receiving contour provides a single buffer. A message received by the destination junction is latched and presented to the receiving message port. The handshaking lines from the receiving message port are used to clear the buffer and prepare for reception of another message. If the buffer is already full when another message comes in flow control is required to prevent messages being lost.

Communication between junctions requires a slightly different protocol to that used between message ports. When a message is sent from one junction to another three handshaking lines are used - ``send'', ``acknowledge'' and ``blocked''. The ``send'' and ``acknowledge'' lines perform much the same function as described earlier. An alternative form of acknowledgement - ``blocked'' - allows a junction to refuse reception of a message if it is unable to deliver it.

Each level of junctions in the hierarchy provides this same form of flow control, so a message stream which is travelling between n junctions to reach its destination can potentially have up to n messages backed up waiting for delivery, one buffered at each junction along the path. This is shown in figure 6.3.

Figure [6.3 - "A long buffering stream"]

A message travelling from one part of the array to a distant one will travel up the hierarchy for a number of junctions and then travel down until it reaches its destination. Since messages can be backed up at each of these steps buffers are required both in the upward-travelling direction and the downward-travelling direction.

6.5.6.1 Downward buffering

Downward buffers are attached to routing table entries. Each routing table entry normally contains information on the virtual address it is routing and where to route it to. Adding buffering to this means that each routing entry must keep track of the entire address of a message, the contents of the message and some flags to aid in flow control.

If a junction attempts to send a message down to a child junction which responds with a ``blocked'' acknowledgement signal the junction must retain the locally buffered copy and flag it in the routing table as being blocked from transmission. Meanwhile the child will have flagged the routing entry as ``waiting above'' - ie. the parent junction is waiting on this entry becoming unblocked. If another message for the same destination is received by the parent junction it will mark its own ``waiting above'' flag and respond with a ``blocked'' message itself.

When a leaf junction is given the signal that the destination message port has accepted a message it unblocks its routing entry. If the ``waiting above'' flag is set it will send a special ``unblocked'' control message to its parent junction notifying it that the route is now unblocked. The parent junction does not pass this message on - it only uses it to clear its own blockage flag if it has one. If it does clear a blockage flag and the ``waiting above'' flag is set it will generate its own control message, and so the flow control signal will propagate back through the system.

``Unblocked'' control messages are similar to normal messages except that they use a special bit sequence which identifies them as a control message. They contain no data bits - their address is sufficient to flag that the route has become unblocked.

If a junction receives an ``unblocked'' control message for a virtual address which isn't in its route table the appropriate route entry has probably been removed by the operating system after the route table filled up. This is not catastrophic - it causes a fault to be raised to the operating system which can then deal with replacing the route entry from memory and clearing the blockage.

It is important that a queue of traffic for one port not block transmissions to other ports. The flow control scheme associates buffers with the addresses in routing entries. This means that if a message is routeable it is also bufferable and only messages to that particular destination capability will be blocked. Messages to other ports will go via other routing table entries and so will be unaffected.

A disadvantage of this scheme is that message ports which share a common capability (see section 6.6.5.5) will be blocked as if they were a single port. This means that some care must be taken with the design of such interfaces to ensure that the situation doesn't arise where one port is blocked with a queue of messages but can't be cleared until a message is received on another message port with the same capability. This will cause the port to become deadlocked.

For this reason it is advisable when dealing with message ports sharing a common capability that each port address being used be converted to a location dependent capability first. An operating system call is provided to perform this conversion, and after conversion transmission is performed using a different method which does not suffer from this limitation. Since location dependent addressing doesn't use the routing tables for buffering there is no problem with a single table entry being used for buffering of multiple message ports.

6.5.6.2 Upward buffering

Buffering of messages which are passing up the hierarchy behaves in a similar manner to those moving down the hierarchy. The difference is that messages being passed up do not, by definition, have a route so they can't be associated with routing table entries.

The solution to this problem is to create a ``blocking cache'' which retains the same information about blocked messages for upward transmission that the route cache keeps about blocked messages for downward transmission. The blocking cache is a small fully-associative lookup table keyed on the capability password.

The information kept in the blocking cache is different from that kept in the routing entries only in that the ``waiting below'' data has a bit for each of the subordinate junctions rather than the single bit ``waiting above'' flag in the routing table. Each of these bits keeps track of whether a particular child junction is waiting on a blockage being cleared for that capability.

When a blockage is cleared the junction may elect to unblock any one of the blocked children. A round-robin service order is used to ensure that preference is not given to a particular child junction.

When a message with a previously unknown capability arrives from a child, it is automatically entered in a free entry of the blocking cache. If no free entries are available a fault is raised to ask the operating system to free up some space in the cache. This is equivalent in frequency to the fault raised when a route is not found for a message to pass down the junction hierarchy to its correct destination.

The size of the blocking cache should be roughly equivalent to that of the routing table as there will be roughly equivalent traffic in each direction. While the routine tables and blocking caches may sound expensive they are uncommon compared with logic blocks so they do not consume as much of the logic array resource as might be supposed. The prototype Switchblade architecture devoted an estimated 10% of its total area to the virtualisation and routing system. Part of the reason why this space consumption could be kept down to a reasonable level was the frequent binding of virtual addresses to location-dependent addresses at run-time.

Note that location dependent addresses have a rather different form from normal capability addresses, and so buffering is controlled in a rather different manner.

6.5.7 Contour table

The contour table has a twofold purpose - to track the placement of contours on the logic area and to allow routes to each message port of each contour to be calculated.

When a junction's routing cache fills, the addition of a new route means that one of the routing entries must be discarded to make room for the new entry. When a message is sent which would normally have used the discarded routing entry it will be forwarded up the hierarchy of junctions until it is found to be unrouteable at the top junction. This will cause a fault to be raised.

The fault handler consults the contour table to construct a new route. The contour table is a sparse table - it maps from the random capability addresses to the physical locations on the logic array where each port resides. Since the table is sparse it is implemented as a hash table. The page fault handler uses the data from the table to ensure that each downward step on the route has an appropriate routing entry. Steps up the hierarchy don't need a route since messages are routed to the parent junction by default.

If adding a route entry causes a routing table to overflow the operating system must decide which entries to discard. The contour table also plays a role here - it keeps track of which message ports are in use and their approximate levels of activity. These statistics are collected using the port usage counters and by similar methods to those used in conventional memory paging systems.

The contour table and its importance to the operating system are discussed in more detail in section 6.2.1.

6.5.8 Mutual exclusion

An important feature of any parallel programming system is the ability to synchronise access to common resources. Hardware devices for example should not be randomly accessible by any program in the system since simultaneous access by multiple programs may cause invalid control signals to be sent or even damage the hardware. It is necessary to provide a method to allow programs to access resources without stepping on each others' toes.

Within a contour synchronisation mechanisms can be efficiently created directly in logic. Timing between contours on the other hand is not guaranteed, so some other mechanism is needed for synchronisation in this case.

The message passing system provides just such a mechanism for mutual exclusion. Since all inter-contour communication takes place using the junctions, these same junctions can provide the facility to arbitrate competition for access to a port.

By default the operating system automatically arbitrates access to message ports on a one-to-one basis - when a contour is started its read ports will all be marked as available and each can be claimed by one and only one write port. Any attempt to start a contour which needs a port which is already in use will result in that write port not receiving acknowledgement until after the port has been released, it has successfully connected and an acknowledgement has been received. In fact the paging algorithm may elect not to start a contour at all if it detects that not all of the required ports are available yet.

Not all resource requirements persist for the duration of a task's lifetime. A task may desire to request access to a port for a short period, and then return the resource for other tasks to access. This arbitration demands a slightly different kind of message port which is capable of requesting access before attempting transmission. This involves two more signals - ``port request'' and ``port acknowledge'' - in the write port interface.

When the port request line is raised a message is passed to the destination junction requesting exclusive access to the destination port. If the port is unclaimed the junction will claim it and reply with a port acknowledge message. Although the write port sees this arbitration through a two-line interface it is completely invisible to the read port as the system performs the arbitration itself.

If a requested port is already claimed the system will stall the port acknowledge signal until the request can be satisfied. This means that a large number of requests could potentially accrue for a port which is heavily in demand. The mechanism for dealing with a queue of port requests is similar to that for dealing with a surfeit of messages to a message port. One request is queued at the destination junction and any more attempts to request access are refused by the junction, causing them to be queued a step further up the line. If enough requests are queued eventually attempts at port requests will be unsatisfiable directly from a write port and a fault will be raised to tell the operating system that the source contour can be paged out.

Not all write ports have the port request feature - the logic is only dynamically created for ports which require the feature. The mechanism for this functionality-on-demand is described in section 6.5.11.

Read ports which are very heavily requested may want to provide a deeper queueing mechanism. A first-in first-out (FIFO) buffer at the read port can accept a larger number of pending requests and satisfy them more quickly than would otherwise be possible. As with the port request logic, this functionality is available as a hardware operating system extension.

6.5.8.1 Shared access

The mutual exclusion facility described does not provide for anything except for simple ownership. There is no inbuilt facility for shared access, or for the common variant of shared read but exclusive write access.

Generally such ``shared access'' systems are actually sequential access systems at the lowest level anyway. Sequential access can be easily achieved by doing accesses which request and release the port with every transmission. This may be inefficient due to arbitration overheads however, so a better alternative may be to provide a read port for each of the competing write ports and provide logic for whatever arbitration scheme is desired in the receiving contour. This is at the discretion of the application programmer.

It is quite possible that the programmer of a contour may not know how many contours will be competing for its resources and so will be unable to provide the appropriate number of ports. In that case there is one other possibility - ports which explicitly do not require arbitration. These ports do not accept normal messages, they only accept fault messages as described later.

6.5.8.2 Fault read ports

Fault ports are used by the operating system to receive problem messages generated by the system. These are analogous to interrupts in conventional general purpose processors. Faults are generated under a variety of circumstances relating to the virtual logic and message transmission system. This is not their only application however - any program can generate or receive fault messages for whatever purpose it likes. A fault message is really just a datagram message which contains both source and destination addresses. Faults are described in more detail in section 6.5.9.

Fault read ports are just like normal read ports except that they do not require arbitration before access and they normally send the port address of the transmitting port along with each received result.

By definition fault read ports will always accept messages asynchronously from anywhere in the system, so they cannot be said to arbitrate access. Some user-defined arbitration mechanism could be layered on top of fault ports to deal with complicated shared access requirements if the individual application required it.

6.5.8.3 Deadlock avoidance

Most read ports in the system are matched with a single write port at compile time so there is never any competition for access. Ports which use the port request feature can make their access requests at random however, leading to potential deadlock situations.

One situation is if a contour has two read ports which are used to accept separate parts of a single request. If two different contours were to simultaneously request connection to this service and through random bad luck in the timing of their requests one gained access to the first port while the other gained access to the second, they would both end up waiting indefinitely for the other to release its claim on the opposite port.

The simplest way to resolve this type of deadlock situation is to have only one read port which combines the function of the two smaller read ports. Since there is no size limit on ports this is an effective answer in cases where combining ports has no unfortunate side-effects.

Another way is to ensure that all port requests are made in a predefined order - such as by increasing order of contour address followed by port address. This might be effective, but is difficult for programmers to achieve since the numbering of contours is chosen by the system and is not explicitly available to the programmer. Even if this information was made available it would be inconvenient to incorporate and the system-chosen numbering could well imply an order of resource use which was not actually what was desired the program. This method was not used in the Switchblade prototype as it was deemed too inconvenient to the programmer.

The system provides a mechanism for reserving a set of resources as an indivisible operation. This is not a hardware function - it is provided by the operating system. It attempts to claim all the ports which have been requested and if any request fails it immediately releases any which it has succeeded in claiming. It then delays until the offending port is released and tries again.

This is by no means a foolproof scheme but it does reduce the danger of deadlock greatly in many situations.

6.5.9 Faults

Faults are used by the system as an equivalent to the hardware interrupts found in most general purpose processors. Unlike in general-purpose processors these faults can be handled either in hardware or software, or a combination of the two.

Faults take the form of messages, and have a similar form to that of normal inter-contour communication. The most important difference is that while most port-to-port communication is one-to-one, faults can be generated from anywhere in the system. To identify where the problem occurred the address of the source contour is sent along with the message. This information is used in much the same way that stack context is used by an interrupt handler on a conventional processor.

Like interrupts in normal processors, faults can be generated internally, externally or in code. Internally generated faults include message queue faults (generated when a message port has a backlog of messages) and unrouteable message faults (generated when a message can't be routed to the addressed destination). Internal faults are generated by the hardware itself in response to a condition which the system has detected.

External faults are generated by an input logic level. They perform much the same function as an external hardware interrupt on a conventional processor, causing control to be transferred to a selected piece of code on reception of an external logic transition. Unlike hardware interrupts they do not stop execution of other code - since the logic array processor is inherently parallel in operation an external fault just means that the appropriate fault code will be executed. If it has some system administrative function to perform such as task switching it may end up stopping a task, but this is not part of the standard behaviour.

Faults can also be generated in code. In conventional processors a software interrupt or ``trap'' can be used to transfer control to a system routine. A similar effect can be achieved by generating a fault directly from a message port on a contour. This differs from sending a message on a normal message port only in that the address of the source contour is sent along with the message.

Faults are really just an extension of the inter-contour communications mechanism. They provide a facility for sending source address information along with a message. This changes the format of the message from (destination, result) to (source, destination, result). Just as interrupts are an extension of the subroutine call mechanism in a conventional processor, faults are an extension of the message passing mechanism in this architecture.

6.5.9.1 Fault masks

Interrupt mechanisms traditionally have an associated set of interrupt masks. These masks have two main functions - to prevent a low priority interrupt from taking precedence over a higher priority interrupt and to provide a simple mechanism for mutual exclusion between interrupt handlers and other code.

Given the parallel nature of tasks running on logic array processors there is no need to prioritise faults. Since multiple fault handlers can usefully operate at the same time there is no good reason to prevent them from doing so.

This is not to say that mutual exclusion between fault handlers isn't an issue. However the system already provides general primitives for mutual exclusion which are quite adequate to the purpose. With these primitives already available it isn't necessary to provide a specialised mechanism such as interrupt masks for the same purpose.

6.5.9.2 Fault vectors

Conventional processors usually have a numbered set of interrupts which cause calls to the appropriate interrupt handlers found in an interrupt vector table. The table provides a mechanism for translating a numbered vector into the memory address of the handler. No such mechanism is necessary in a logic array processor as messages to contours are already addressed by a virtual address. If a set of contour numbers are reserved for faults then the existing routing mechanism will automatically translate the address and nothing more need be done. Message ports need only be configured to generate a fault message with the appropriate fault number as the destination address.

A efficiency problem with this scheme is that routing information must be provided for faults as well as conventional messages. Since faults are often relatively rare their routes won't appear in the route cache and another fault will be generated to add the route to the route caches of all the junctions on the fault's path. This implies a considerable overhead which may well be wasted if the fault doesn't occur again within a relatively short period of time.

The ``Location dependent addressing'' feature provides a much more efficient way of handling this kind of access, resolving the virtual number to a hardware route. This translation is automatically performed by the operating system at its discretion. Location dependent addressing obviates the need for an entry to be created in the route cache of each junction along a communication path. ``Short addressing'' further optimises the address for specific operating system faults to a short form which uses less array space and is even more rapidly routed.

6.5.9.3 Saving state

An important feature of interrupts in conventional processors is that they save the state of the currently-running task and can return to it later as if nothing had happened. Unfortunately this kind of approach is not appropriate in a logic array processor. Since logic arrays are essentially parallel it is not clear whether tasks should be halted at all. Moreover, saving state is not just a matter of pushing a few registers on to the stack - the entire set of variables of the program could potentially be in hardware so saving the state may end up being a lengthy process. Even if the entire program state was saved, there is little evidence to suggest that this would serve any useful purpose anyway.

It is clear that faults and interrupts have quite different properties. Still the need to be able to alert the system to emergency conditions remains. We also need a way to prevent a program which has experienced problems from running out of control before the operating system is able to deal with the problem.

For this reason an option is available on fault send ports which both sends the fault message and halts the contour outright. The contour will remain halted until the operating system fault handler explicitly restarts it.

6.5.9.4 Halting contours

All contours have a master signal which tells them to halt. After this signal has been raised they will have halted all internal activity within a set period of time - 16 ``fast clock'' cycles in the test system. Once the contour is halted its state can be examined by a fault handler, swapped to backing configuration store or saved to memory. The halting mechanism is provided by the logic array itself in collusion with the circuit synthesis system.

6.5.10 Selectable destinations

The inter-contour communications mechanism provides for one-to-one communication between contours and many-to-one communication via faults. The remaining possibility is one-to-many communication - in other words a message port which can send messages to any message port address at will.

Access to this type of port opens a possibility for security abuses. The ability to send random messages to random ports could easily confuse the operation of an algorithm. Spurious messages sent to a contour will most likely cause it to malfunction. The capability security system described in section 6.6 provides some protection against this form of attack.

6.5.11 Extensible operating system functionality

One fundamental advantage of reconfigurable logic is that it can be generated as needed. This feature is used by the system to create system services only as needed.

For instance the junctions contain all the necessary hardwired circuitry to do normal routing and message handling. They can also internally generate faults. They do not have the ability to accept external or code-generated faults though and they also can't do dynamic destination selection. It isn't worth adding this extra functionality to every junction since these features are relatively rarely used.

These functions are provided by extra reconfigurable logic which is created by the system as required. The logic generated for these purposes is unique in that only the operating system is allowed to interface in these special ways to the junctions - no contour has this ability. This extra logic remains in the domain of the operating system - it looks to the contour just like a different message port interface.

The ``port request'' feature available optionally on write ports and the ``fault port'' types of read and write port are examples of extra functionality available through operating system extension. Another is FIFOs.

6.5.11.1 FIFOs

Another extended message port function is FIFOs. These provide buffering between contours.

In a situation where the source and destination contours of a result stream have very different timing characteristics a FIFO may increase overall performance greatly by storing extra results until they can be consumed.

For example take a situation where one contour produces results in brief spurts and then pauses for a while until it has calculated the next set. Meanwhile the destination contour processes results at a relatively constant rate. When the second result reaches the destination contour it will still be processing the first, so the source contour will be made to wait until it is ready to receive another result. This will continue with each of the results slowly being accepted while the destination contour processes as quickly as it can. In the meantime the source contour is almost completely idle.

Once the last result of the burst has been sent the source contour will go back into heavy processing mode for a while. In the meantime the destination contour will be idle.

Placing an appropriate-sized FIFO between these contours allows them both to run at full speed for most of the time. The source contour produces its first set of results and these are buffered by the FIFO. The source contour then continues with its calculations while the destination contour churns gradually through the previous set of results.

6.6 Security and access control

Modern processors provide the ability for the operating system to restrict the access of user programs. This prevents user programs from using resources which they do not specifically have permission to use and from interfering in the operation of other programs. This architecture is no exception. In fact the need for access control is even more pressing in the case of a reconfigurable logic array processor as a malevolent process could potentially damage the processor itself by configuring it with an invalid circuit.

The security model for a logic array processor must be quite flexible as the very nature of the processor allows some rather unusual operations. It should be possible for instance to modify a configuration on the fly, under the control of a user program. Yet it should not be possible to create a damaging configuration. It should be possible to transfer access rights to other subsystems such as libraries. Yet there should be some mechanism for preventing rogue processes from obtaining privileges they shouldn't have, and from interfering with the operation of the rest of the system.

In this section I will use the terms ``security'' and ``access control'' with slightly different meanings. ``Access control'' I will use to mean ``provision of resources to those tasks which need them'' and ``security'' I will use to mean ``preventing tasks from gaining unauthorised access''.

6.6.1 Capabilities

The basic method of access control is by capability. Each resource in the system has a capability address. Access to a resource is through its capability address. A capability address might correspond to a message read port, for instance. ``Writing'' to that address will cause data to be transferred to the read port of the appropriate contour.

Unprivileged tasks will not be able to send to that port by virtue of the simple fact that they don't know its address. Tasks can give away access rights to other tasks by passing them the capability addresses of whatever resources they want to give away access to.

Capability addresses have an equivalent in the folklore concept of ``true names'' - to use someone's true name was to have power over them. In the case of capability systems the ``true address'' is the only method to have control of a resource.

Note that simply giving away a capability isn't enough to ensure that access rights will be transferred - many access rights are considered non-transferable by the system. An example of this is the default style of message ports which are paired when a task is started and only become unassociated again when the task ends. This gives no other port a chance to muscle in on their communication.

6.6.2 Capability addresses

An essential feature of capability addresses is that they are hard to guess. To this end the operating system deliberately generates capability addresses as long pseudo-random numbers [WallacePose90] [AndersonPose86]. It is important that a good pseudo-random number generator be used as it should be impossible for even a malicious task to guess the sequence being used.

Capability addresses should be quite wide. The exact width used is a tradeoff between the system overhead of handling large addresses and the security risk of using shorter, more guess-able addresses. The random ``password'' number is usually just a subsection of the capability too - the remaining bits of the address may contain additional information regarding a subsection of the particular capability being referred to.

Switchblade uses capability addresses with a number of fields. A standard Switchblade capability has 32 bits of password capability address, two bits of capability type information, and 30 bits of sub-address. Often the sub-address is ignored or only a limited number of bits are passed through. Message ports for instance only use four bits of the sub-address, so only four bits are passed through the system to them.

6.6.3 Intrusion prevention

Even with 32 bits of capability password it is conceivable that a malicious task could try all four billion combinations in search of the key which will unlock access to a resource. Once it had access it could use the capability to gain unauthorised access to data, disrupt operation or gain access to further resources.

The defence against this is that each attempt to access an invalid capability causes a fault to the operating system. The transmission of an invalid capability in the hardware routing system will cause it to be sent step by step to the root junction as no junction will have a routing entry for it. The root junction will then cause a fault to the operating system saying that no route has been found. The operating system route handler will look up the capability in its contour table, find that no such capability has been allocated and raise a security fault.

Security fault handlers are responsible for detecting an attempted intrusion and preventing the malicious task from continuing its attempt. This is not as trivial as it may sound since it is difficult to identify which task was the malicious one. The task which passed the invalid capability to the system may not be the malicious task, it may merely be a dupe task which was handed the invalid capability by the malicious task. Terminating the task which passed the invalid capability to the operating system would leave the system open to a denial of service attack. Malicious tasks could pass invalid capabilities to any program which accepted parameters and cause them to be terminated when they tried to access the parameter capabilities. The scheme used in the prototype system was relatively simple and inadequate to prevent denial of service attacks.

Even though the prototype Switchblade system was overly simplistic in its operating system support for capabilities it achieved its purpose of providing enough architectural support to make the development of a true capability-based security system possible. This type of security system is a fairly well explored area so implementing operating system support for a complete one should be a relatively straightforward implementation task. Implementing a complete operating system is outside the scope of this thesis.

There are a number of special capabilities in the system which control access to fundamental system resources. Device handler capabilities are some of these - another is the capability which gives access to the junction configurations. Any task which has the junction configuration capability effectively has control of the entire system - the junction configurations contain most of the useful capabilities in the system in their routing tables. It goes without saying that such capabilities as that for junction configuration are closely guarded secrets.

6.6.4 Message port access

Generally speaking, each resource in the system has an individual capability. Message ports are no exception - they are by default addressed individually, with a different capability for each port.

A large system with many message ports will naturally have a large number of capabilities to keep track of, and each capability which is in hardware has a significant cost in routing table entries and message port hardware. Added to this is the problem of tasks which want to pass capabilities to other tasks. It may be desirable for them to pass access to a set of related ports with a single capability rather than having to pass a number of individual capabilities to a task.

For these reasons it is possible for a number of message ports in the same junction to share a capability password. They can then use sub-addressing to determine which of the message ports within the capability is being referred to.

6.6.5 Optimisation

Capabilities provide an effective and flexible method of access control. The disadvantage is that they are also quite expensive in hardware. Since capability addresses are very wide - 64 bits - the hardware necessary to handle such wide addresses for each message port could well be greater than the hardware handling the rest of the message port. To reduce this expense most message addressing bypasses the capability system altogether. While all user programs see the system as operating entirely on capability addresses, in fact many addresses are converted to hardware locations by the paging system when a contour is loaded into hardware.

6.6.5.1 Location-dependent addressing

This optimisation uses a different style of address which is more efficiently routed than the full capability address. Unlike capability addresses these addresses make no attempt to be location-independent - they only work when both ends of a message link are in hardware and the address is entirely dependent on the locations of the communicating contours in the logic array. These addresses only exist for as long as the two contours remain in hardware, and in those specific locations.

Most write message ports on a contour communicate directly with a corresponding read message port on another contour and are connected to this port for the entire lifetime of the task. The paging system is aware of the capability address of each message port and of the links between all the contours in a program. When a write on a ``hard'' message port causes another contour to be brought into hardware the paging system automatically converts the send address of the writing port to the location-specific address. These short-form addresses are transparent to user tasks - they are hidden in the configuration of the junctions and cannot be accessed by the user task at all.

Location-specific addresses are prefaced by a bit string which identifies them as being different from full capability addresses. The remainder of the address is in effect a routing path for the message to take through the system. Translated into English, this type of address might equate to for example ``up to parent junction, down to fifth child junction, send to ninth read message port''. This style of addressing is significantly shorter than a full capability address and doesn't require the same complexity of junction hardware as the full capability addresses. Since the paging system attempts to place contours in such a way that they are located physically close to the contours which they communicate with the addresses are kept as short as possible.

In fact since full capability addressing is the exception rather than the rule, many leaf junctions may not have hardware for routing capability addresses at all. The basic leaf junction hardware in Switchblade doesn't include the expensive translation tables necessary to convert capability addresses into port addresses. If such hardware is required it can be added to the junction using reconfigurable logic, and only for those ports which need it.

When algorithms are synthesised into hardware the intention is that all addresses which are static for the duration of a contour are automatically synthesised as location-dependent addresses. The actual location-dependent address for a message is configured into the message port by the operating system at load time or when the destination contour is moved into logic. The prototype system used an on-demand method for this, configuring port addresses lazily when an attempt was made to access them. The default unconfigured state of a message port caused a fault to be raised and the runtime loader handled the back-patching of the contour from there.

It is important to minimise the distance which messages must travel to reach their destinations - reducing the number of hops not only reduces the length of the address and hence the space used, it also reduces the transmission time.

6.6.5.2 Flow control with location dependent capabilities

Location dependent addresses have a rather different form from normal capability addresses. Since the format of the addresses specifies the path taken messages are freed of the need to have corresponding routing table entries. This lack of routing table entries makes buffering and end-to-end flow control for location dependent addresses rather problematic as this functionality is otherwise provided by the routing tables. Adding an equivalent per-junction buffering mechanism is overly expensive of logic resources and would defeat the purpose of using lightweight location dependent addresses.

Instead, an entirely different method of flow control is used. Each message is sent and acknowledged immediately at the sending message port as usual. Buffering and queueing happens only at the receive port, and it is assumed that ``enough'' buffering will be provided to deal with most cases. Optionally-configurable FIFOs are used for this purpose.

When a message arrives at a blocked receive port the only way of dealing with it is to prevent the sender from sending any more until the port is clear again. There is no direct way of doing this, so a bit is set in the receive port to indicate that it has overflowed and a fault is raised to the operating system. The fault handler is normally in hardware since it must act quickly to prevent too many more messages from being sent to the blocked port.

The fault handler's first action is to set a bit in the send port, blocking it from further transmission. The actual behaviour of this mode is to prevent sending of any messages and to not provide the usual acknowledgement on the send port. The handler then queues in memory any messages from that source port to that destination port. A number of messages may have been transmitted before the source port is disabled so the operating system must provide for reception and queueing of a number of messages.

When the receiving message port is cleared (or the attached FIFO drops below its low-water limit) another fault is sent to the operating system informing it that the send port can resume transmission. The fault handler transmits the queued messages to the receive port and if it does not overflow again during this process the send port will be re-enabled for transmission.

Fault ports may receive messages from a number of source ports, so disabling a single source port may not prevent the reception of further messages on the destination port. In this case each message from a new port will cause a fault to be generated and the operating system will disable each of the sending ports in turn while queueing the messages for later retransmission.

6.6.5.3 Short addresses

Some ports such as those which provide system services or library functions to user programs are far more frequently used than others. Often though these services will be sufficiently large and complex that they always run ``soft''. In that case the address can take a special short form which raises a fault to the operating system directly, without requiring the full capability address of the server port to be specified. This requires each of these special ports to be registered specifically with the operating system. User tasks cannot normally do this - only privileged operating system services are permitted to use this limited short-address space.

6.6.5.4 Wired routes

Another common routing optimisation is to provide a direct wire link between two ports. This is the normal mode of communication in standard FPGAs but in Switchblade it is less common since routed message passing takes the place of wires in most cases. There is no substitute for directly configured wires if speed is of the essence however, so the system may decide to establish a wired communications link between two ports. This kind of link can be considered to be a special case of location-dependent addressing. In this case the addressing takes the form of establishing electrical connections from the source to the destination using the configuration hardware. No ``address'' as such is stored in hardware as the operating system creates the routes directly in configuration store.

Careful synthesis of communicating contours can mean that high-bandwidth ports are usually located physically close to each other. This makes wired routes short, inexpensive and fast. While there was no specific provision for controlling contour layout in the prototype Switchblade some knowledge of the loader's behaviour meant that manually generated programs could be made to run very fast in this way. A more sophisticated system might provide ``hints'' as to what ports needed to communicate at high speed and provide synthesis tools and operating system support geared towards this.

6.6.5.5 Sub-addressing

Each capability address has an extra sub-address which is not used in the inter-junction routing process. It is designed to be used for sub-addressing within a capability. For example a memory address capability might give access to a page of memory and the remaining sub-address bits give access to individual memory locations within that page.

Sub-addressing gives the ability to confer access to a whole group of resources with only one capability. It also vastly reduces the number of capability routes which need to be stored in hardware, reducing the hardware cost of supporting capabilities significantly.

Often the port addresses of a contour will all have the same capability and differ only in sub-address. This is not a security risk as most contours only communicate with other contours of the same task. Since it must be assumed that you can trust yourself there is no danger in providing access to all the ports through the same capability. On the other hand contours which communicate with parts of the system outside a task cannot be trusted and the ports which do this dangerous communication are given safe addresses which protect the contour from interference on its other ports.

In most cases this will not even be necessary at all. If a contour can guarantee that all its read ports are automatically connected to other ports when the task is started it can already be assured that no interference with its operation is possible, and may blithely quote its own private capability address to external entities, safe in the knowledge that they cannot harm it.

6.6.6 Non-transferable capabilities

Under some circumstances a task may want to give another task a capability, but not want to take the risk that it will pass it along to a third party.

It is not immediately obvious that this is necessary in the ordinary case of two ports being set up to communicate with each other - the read port is claimed for the duration of the task and cannot be interfered with. Still, it's possible that if the second task passed the capability to a third task then this task could use the capability to interfere with a later instance of the first task.

For these reasons a mechanism is provided to create capabilities which are non-transferable. When the system creates a virtual connection between two message ports at task startup it creates the connections to capabilities which corresponds to instances of the message ports, not a persistent capability which is present between executions of the task.

This method is used for connections which persist for the duration of the task, but is not effective for fault ports as they are not connection-oriented. An operating system extension can provide extra functionality in the junction which allows fault read ports to select which write ports they will communicate with. Messages from disallowed ports will be transparently discarded.

The expense of this message filtering is not inconsiderable, so it should be used only when security is a major consideration. In systems without malicious agents this kind of situation is unlikely enough to occur by accident that it ceases to be an issue.

6.6.7 Restricted access capabilities

Sometimes a capability will confer more than a single access right on the holder. A memory capability for instance may give access to a range of memory locations, and may permit read, write or software execute privileges. A task may wish to pass some but not all of these abilities to another task. A common example is where another task is provided with read-only memory access to a shared memory area along with some mutual exclusion capabilities to mediate communication.

An operating system service is provided to create a restricted capability of this kind from an existing capability. Note that access can only ever be restricted in this way, existing restrictions can never be relaxed.

The ways in which restrictions can be applied depends on the type of resource in question. With virtual memory it may be possible to restrict the scope of memory access, but this may only be possible on a per-page basis rather than on a byte-by-byte basis.

A task possessing a capability to a set of ports on a contour might want to give away access to a reduced set of these ports. This is more effectively achieved by providing the reduced set of ports with a different capability in the first place, and so is not supported by the system.

6.7 Sequential execution

The ability of a program to be entirely executed on a logic array is limited by the size of the program - if it's larger than can be fitted on the logic array then parts must either be swapped or executed as sequential code. As discussed in section 4.2, algorithms can be expressed in a variety of forms, some of which will be more expensive of array space than others. In cases where it isn't possible to fit an entire program in the logic array at one time it may be possible to execute the less heavily used parts of the program as sequential code. Which parts of the program are to be executed sequentially and which parts are to be executed in hardware can be determined at run-time by the paging system.

In chapter 4 it was pointed out that the distinction between sequential and parallel code is not absolute - gradations in between are possible which trade off logic array space against faster (and more parallel) execution. This aspect is not part of the Switchblade simulation but is definitely a promising area for further research. Switchblade provides only the coarsest level of granularity - hardware or sequential.

All object code to be executed on the system comes in two forms - contours to be executed on the logic array and sequential code to be executed by a sequential processor if no space is available in the logic array. The paging algorithm determines which method is used to execute an algorithm at any given time.

Any algorithm which is executed in software is assumed to be little enough used that no great performance hit will be taken by executing it in software. This implies that the sequential processor need not be particularly powerful, either.

Rather than provide a powerful external processor, the system permits areas of logic to be configured to a special ``cooperative processor'' mode. A very small amount of specialised logic is provided in each logic block to optimise the array for a specific configuration which is a sequential processor. Each eight by eight logic block area of the array has a repeating pattern of this special circuit, meaning that a sequential processor can be aligned only on eight-block boundaries. At the cost of two percent extra space used for the special logic it permits the sequential processors to be squeezed into only 60% of the space they would otherwise require.

Any number of sequential processors can be present on the array at once. Due to their bit-slice design they can also be ``stacked'' to provide processors of greater word widths.

Processors have access to the configuration store, and will commonly be used to create complex configurations on the fly as a symbiotic hardware algorithm is operating. Since reconfiguration can be a relatively complex and infrequent operation it makes sense to conserve logic array space by executing it in software.

Instruction code for the processors can be accessed either directly from configuration store or from an external RAM. Running code from the configuration store saves on I/O and allows quicker execution, but is less frugal of precious configuration space.

6.7.1 Sequential processor architecture

The sequential processors have a very simple four-bit RISC design. They are bit-slice processors so they may be conveniently cascaded to provide wider data paths.

In the standard configuration a processor has a simple single-register architecture. By adding more logic blocks it can be expanded to have more registers.

Many of the components required by the processors are common with the functions provided by the logic blocks. The bit-slice processors each have ALU, register and counter functions. These are also available to the logic array as part of the conventional function of each logic block when it is not configured to be part of a processor. Similarly the support provided in each logic block for state machines is shared by blocks which are configured as instruction execution units when used in processors.

The instruction stream is interpreted by an instruction execution unit which is simply programmed into a logic block - while some special support is provided by the array for this function a level of flexibility is retained to allow for flexibility in the instruction set. More complex execution units can be created using multiple logic blocks. A great deal of the data path routing functions in the processor are taken care of by the use of a combination of the multiple configuration stores and routing circuitry. When the processor enters a state which requires a particular data path to be set up it selects a corresponding configuration store which has the appropriate routes programmed into it.

For larger processors a configurable number of logic blocks can be configured as the program counter, depending on the address width required.

It is quite viable to configure the whole logic array as a huge number of parallel sequential processors. In some applications this will provide greater processing power than using the array for a smaller number of higher-speed functions.

One very desirable feature which was not implemented for Switchblade is an ability to translate between variables in a hardware implementation and those in a software version. This would permit contours to run in either state depending on the amount of logic available and the context to be transparently switched from a hardware one to a software one or vice-versa. The algorithm to do this would have to operate in collusion with data obtained from the synthesis system. This is a worthwhile area for further research.

6.8 Performance estimate of the virtual logic system

A simulation of the Switchblade virtual logic system was used to assess its performance potential. Creating this simulation was complicated by the fact that no circuit synthesis tools for Switchblade currently exist. These tools are outside the scope of the thesis and how to create them is a research area in itself.

As an alternative to a direct simulation a higher level simulation was created. A large, complicated program of a type which is normally poorly suited to execution on logic arrays was analysed and an execution profile formed. This profile was then applied to the Switchblade model and estimates of the system performance were generated.

6.8.1 Paging algorithm

In any paging system the paging algorithm has an extremely important influence on performance, and the virtual logic paging algorithm used in Switchblade's is no exception. The algorithm was created from scratch for the simulation using a relatively simple random page replacement strategy. The algorithm for contour checking and page replacement is shown in figure 6.4.

procedure Contour.pageIn ( PreferredConfigNo )
begin
    if Contour.inConfigRam
    then
        if Contour.allPagesPresent
        then
            if Contour.allPagesSwitchedIn
            then
                return             // all is well already
            else
                Contour.switchConfigIn()    // switch this config in
            endif
        else
            // not all of the contour is present
            Contour.unload()
            Contour.load(PreferredConfigNo) // load to a new location
        endif
        
    else
        // not in the configuration store
        Contour.load(PreferredConfigNo)
    endif
end
Figure [6.4 - "The page replacement algorithm"]

The preferred configuration number is the configuration store of the calling contour. Contours are preferentially loaded into the same configuration store as the contours which are most likely to precede and follow them. This increases the likelihood that a working set of contours will be placed in the same configuration store and hence less configuration switching will be required. Note that since configuration stores are switched on a per-page granularity it is still possible to have two communicating contours simultaneously switched in even if they are in different configuration stores. Keeping them in the same store merely decreases the likelihood that they will physically overlap, which would require fast configuration switching for them to communicate and prevent parallel operation.

Loading of contours uses a set of free lists arranged in a per-configuration-store basis. The free lists are indexed to facilitate quick searching for contiguous areas of logic sufficiently large for a contour to fit in. The prototype algorithm was not particularly clever about fitting strangely shaped contours together efficiently. Fortunately the vast majority of contours were very small - only one page in size. See figure 6.5 for the algorithm.

procedure Contour.load ( PreferredConfigNo )
begin
    // try to find space in the preferred configuration
    Found <- FreeList[PreferredConfigNo].search(Contour.noOfPages)
    if not Found
    then
        // try to find in some other random configuration
        try 8 times until Found
        begin
            ConfigurationStore <- random(AllConfigurationStores)
            Found <- FreeList[RandomConfig].search(Contour.pageSize)
        end
        
        if not Found
        then
            // second attempt - a more brutal try at replacement
            do
                // try all configurations at a random location
                Location <- random(ARRAYWIDTH, ARRAYHEIGHT)
                Found <- false
                foreach ConfigurationStore until Found
                    if ConfigurationStore.cell[Location].empty
                    then
                        Found <- true
                    endif
                endfor

                if not Found
                then
                    // pick a random configuration and wipe what's there
                    ConfigurationStore <- random(AllConfigurationStores)
                    BadChoice <- assessImportance(Contour.pageSize)
                    
                    if not BadChoice
                    then
                        ConfigurationStore.unload(Location, Contour.pageSize)
                    endif
                endif
            
            while BadChoice
            
        endif
        
    endif
    
    // physically load it in
    ConfigurationStore.physicallyLoad(Contour, Location)
    ConfigurationStore.switch(Contour, Location)
    
end
Figure [6.5 - "The contour loading algorithm"]

6.8.2 Sample application

For the simulation an application was chosen which had characteristics uniquely unsuited to logic arrays. The application was chosen for qualities of poor parallelism and a large code size. The GNU C compiler ``gcc'' was chosen as an application with these qualities. No attempt was made to exploit the parallelism offered by the logic array processor - the run profile used in the simulation was in fact obtained from execution on a conventional sequential processor.

Each function in gcc was translated into an equivalent area of logic, to be regarded as a ``contour'' for the purposes of the experiment. An estimate of the number of pages taken by each contour was used to determine the space used by each contour on the array. Pages comprise an 8 by 8 area of logic blocks. The largest contour in the example was 497 pages, necessitating a minimum array size of 512 pages. The vast majority of contours were much smaller, mostly one or two pages. For the following simulation results the application began in a completely unloaded state and was demand loaded during execution.

Simulating the run by compiling gcc into a logic form and simulating the logic for an entire simulated ``gcc'' compilation run would have taken years of simulation time on the available equipment. It was decided to vastly increase the speed of the simulation by working from profiles of gcc running in software on a conventional processor. These profiles were then used to guide the flow of control between contours in the simulation. Since this simulation was of the virtual logic system and not of the logic array itself the lack of low-level simulation was not crucial.

The run profiles were generated by creating a specially instrumented version of gcc 2.7.2.2 which included checkpoint logging at the beginning and end of each function. These checkpoints included timing information which was used to monitor how much processor time was being consumed in each function. The times were not adjusted to count a possible speedup due to the direct implementation of these functions in logic. Keeping in mind that this is not a simulation of the overall performance - only that of the virtual logic system - accurate simulation of the speed of the logic system was not a priority. Comparing implementation in logic with software implementation on a conventional processor, generally a small speedup might be achieved since many instructions in a normal program store data or transfer data around and can be replaced by much faster local wiring in a logic array. Rather than attempt to make predictions of this sort for all types of code the assumption was made that execution would take roughly the same time on each architecture - which is probably true to within an order of magnitude in the worst case and assuming that no advantage is taken of the parallelism of the logic array. Since no genuine implementation exists it is difficult to determine Switchblade's low-level speed characteristics accurately, but they can be assumed to be roughly equivalent to those of comparable FPGA systems if none of Switchblade's special features are exploited. For non-parallel applications FPGA systems generally run faster than conventional processors in applications where a great deal of data manipulation is involved and slower in applications where a great deal of floating-point computation is involved (which is normally executed on a heavily optimised floating-point unit). The compromise here was to assume an average of approximate equity between Switchblade and a conventional processor, and given that this high-level simulation was intended to model the behaviour of the virtual logic system and not the low-level performance of the logic array this was considered an adequate approximation.

Each simulation run was guided by these basic estimates of how long each function would take to run and at what points it would transfer control to other functions. Each transfer of control was simulated as a transfer of data between Switchblade contours. The time taken to send data between contours was modelled, as was the time taken to switch between multiple configuration stores and the time to load in contours from external memory if they were not paged in. At no point in this simulation was more than one concurrent operation being performed - simulating the slowest possible circumstance as far as a highly-parallel logic array system is concerned. Since the simulation assumes that Switchblade functions run at the same speed as the MIPS 4000 processor on which the profile was executed, the best case speed would be equal to (or marginally slower than) the original profile.

In a case where a very large array is available little switching will occur and communication will pass between contours quickly. If transmission of messages requires switching between configuration stores some overhead is incurred and message passing takes somewhat longer. In the worst case where the logic array is small and destination contours are likely not to be configured anywhere on the logic array at all OS intervention is required and the contours must be loaded into the configuration store before messages can be transmitted - slowing communication dramatically.

The simulation program used the paging algorithms described earlier to reproduce as accurately as possible the effects of gcc being run on Switchblade. As stated earlier this kind of high level simulation cannot claim to be very accurate, however the purpose of the exercise was to determine if the virtual logic system was viable in general terms so estimates of the lower-level function were judged to be adequate. The purpose of the simulation is to discover the variety of conditions under which it behaves well. A number of the following simulation results address the performance of the system when various design parameters are modified and examine the circumstances under which it performs poorly. Some attention is also paid to how gracefully performance degrades in these situations.

Figure [6.6 - "Performance of various configurations"]

Figure 6.6 shows a graph of performance against logic array size and the number of configuration stores. A performance of 1.0 implies that the application was able to run at full speed without being hindered by paging activity. As expected, performance gains are realised both by increasing array size and increasing the number of configuration stores.

For the characteristics of the example ``gcc'' application, the Switchblade prototype's parameters of 1024 pages and eight configurations are close to the optimal choice, providing high performance and economical hardware cost. Of course the level of performance will vary with the size and number of applications running, but the ability to run gcc efficiently indicates that the system is in the same order of performance as current-generation general purpose processors.

Irregularities in the graph demonstrate a slight affinity of the application for particular scalings of the architecture. Note for instance that a seven configuration, 512 page system performs better than one with eight configurations. This is due to a combination of attributes of the application, the array space and the paging algorithm.

Figure [6.7 - "Page load overhead in various configurations"]

The two sources of overhead in the system are when pages are loaded into configuration store and when configuration stores are switched between. Of the two, loading configurations is vastly more expensive. Loading a single configuration store for a single page in the simulated system takes approximately 20 microseconds. Switching between configuration stores takes a few nanoseconds, and consequently has negligible impact on the overall performance as long as it is kept to a sensible level. In this simulation the levels of configuration switching were low enough that they were invisible against the cost of configuration page loads.

Figure 6.7 shows the total number of configuration page loads for each type of logic array. In the case of the 512 page, single configuration store array the system was thrashing, spending 96% of its time loading configurations and only 4% of its time computing.

Figure [6.8 - "Contour switching in various configurations"]

Figure 6.8 shows the number of configuration switches occurring during execution. Each switching operation affects an entire contour at once and occurs within nanoseconds, so the effects are relatively insignificant. The graph does however demonstrate the effectiveness of the paging algorithm. More switching naturally occurs in situations with more configurations or smaller logic arrays. The relative randomness of the graph shows the complexities of the application interacting with the paging algorithm at various array sizes.

Contour
(each pixel
corresponds
to two
contours)
  time -»
Figure [6.9 - "Contour usage in gcc (refer to text for explanation)"]

The characteristics of the gcc execution profile can be seen clearly in figure 6.9. The x axis is execution time, from program start at the left to termination on the right. Each location on the y axis corresponds to a single contour. The graph demonstrates the execution of each contour over time. The first quarter of execution is parsing. Next follows a brief phase of code generation and then optimisation. Further phases of code generation are interspersed with more optimisation for the remainder of the run. This diagram gives some idea of the changing code requirements of the application over time.

Figure [6.10 - "Working set of gcc"]

Note that in many of the following graphs simulation time is given in seconds rather than using ``ticks'' or some other simulation time unit. This is quite deliberate as ``ticks'' are not particularly applicable to this kind of system so the simulation was performed approximating the performance of an implementation with current-day hardware performance. The reason why ``ticks'' are not applicable is that inter-contour communication is variant in the amount of time it takes dependent on many factors including the distance on the array between two communicating ports, the amount of data which needs to be transferred and what options are implemented on the ports. There is no central time tick which regulates the transfer of data around the system at the lowest level - much of the port communications logic is asynchronous so it isn't really possible to apply the unit of ``ticks'' to this simulation. Instead some estimates of the likely transfer times were made given the design described in this chapter and assuming a logic speed consistent with current-level technology. The time taken by processing within a contour is also not necessarily synchronised to any clock, but will take whatever amount of time the logic circuit within the contour happens to take. For these reasons seconds are used as the time unit for the purposes of this simulation, even if the actual time used in a real device will differ depending on the technology available. The timings are in any case approximate and are not crucial to the experimental results, which are more related to the performance degradation characteristics of the virtual logic system than the outright performance of the Switchblade system as a whole. The markings on the horizontal axis of these graphs can therefore be considered an example for one implementation rather than definitive. Other implementations would have similar characteristics but a different scale on this axis.

Figure 6.10 shows the working set of the program over time. A relatively small, constant working set during parsing later varies wildly between optimisation and code generation phases. The number of pages in the working set never exceeds 2800. This sample data was applied to an array of 1024 pages and 4 configuration stores. This size was chosen as it demonstrates a point on the performance curve where degradation is beginning to be serious. Note that the total number of pages available is 4096, significantly higher than the theoretical minimum of 2800 which are required to prevent thrashing assuming a 100% effective paging algorithm and no fragmentation.

Figure [6.11 - "Performance of gcc"]

The performance of the array over time is shown in figure 6.11. There is some initial slowdown as contours are demand-loaded. As soon as the working set of pages related to parsing is loaded performance quickly reaches the maximum. When the operating characteristics of the program change paging begins and the performance is variably degraded for the remainder of the run.

In reality the execution time of the performance degraded run is longer than the 1.9 seconds of actual execution time. Page configuration loads become a significant factor, extending the run time to over 4 seconds.

Figure [6.12 - "Working set graph adjusted for slowdown"]

Adjusting the working set graph for the effects of paging results in figure 6.12. The earlier graph (figure 6.11) showed the working set as if executed on an infinite logic array. Peaks where the working set is large pass quickly as large working sets don't degrade performance on an infinite logic array. This graph shows the effect of limiting the size of the logic array to a level where paging occurs - execution takes significantly longer and the points where execution is slowed the most are also those where the working set is largest. Execution is almost as fast where the working set is small however, for instance at around the 2.0s mark which corresponds to the 1.0s area in figure 6.11. This confirms that large working sets cause performance degradation, as expected.

Figure [6.13 - "Performance graph adjusted for slowdown"]

The performance graph is similarly extended. In this case the points of lowest performance tend to be stretched as the paging system struggles to load in a new working set (figure 6.13).

Figure [6.14 - "Array space used over time"]

Figure 6.14 shows the amount of array space used against time. Starting at zero when execution begins, it raises and then plateaus during parsing. When code generation starts however the remainder of the array soon fills and some contours begin to be paged out to make room for new contours. Notice how the theoretical maximum of 4096 is never reached as fragmentation prevents the array from being fully utilised.

Figure [6.15 - "Page loads over time"]

The rate at which pages are loaded during execution is shown in figure 6.15. The maximum rate of about 45,000 page reconfigurations per second is frequently approached, sometimes for extended periods as large amounts of reconfiguration occur.

Figure [6.16 - "Page unloads over time"]

Page unloads on the other hand are less directly limited by the hardware since they need not be physically copied out of the array, only displaced by a new configuration or partially displaced. Figure 6.16 shows where pages are removed from the array, peaks mostly corresponding to peaks in page loading except at the start where the array is still filling up.

Figure [6.17 - "Contour switching over time"]

Contour switching is plotted against time in figure 6.17. Where the array is still underutilised during parsing contour switching is very effective in preventing page loads. Contour switching is high but page loads are very low. Later, when the paging system is struggling to keep the working set in configuration store switching is used sporadically but often configurations must be obtained from off the array so switching rates drop correspondingly.

Figure [6.18 - "Contour loads over time"]

Figures 6.18 and 6.19 show the rate of contour loading and unloading over time. These mostly agree with the page load graphs, except that in places where the page loads peak out the rate of contour loading tends to drop. This implies that the places where performance is worst affected by page loading are where the contours being loaded tend to be large and have many pages. This is reasonable - very large contours tend to require a large number of small contours to be paged out to make space, which in turn causes these contours to need to be paged back in again, causing further performance degradation down the line.

Figure [6.19 - "Contour unloads over time"]

6.8.3 Experimental conclusions

Simulation indicates that Switchblade's virtual logic system is viable. A limited set of tests shows that the system performs as expected from such a system and even when presented with tasks which are more traditionally associated with general-purpose processors it is capable of performing well.

A relatively simple paging algorithm performed acceptably, degrading gracefully when resource limits were approached. More research into paging algorithms for this two dimensional logic resource would almost certainly result in improved performance, as would synthesis tools capable of keeping maximum contour sizes to a lower level.

The choice of array size for Switchblade turns out to have been a fortuitous one. The array size of 64k logic blocks, or 1024 pages with 8 configuration stores appears to be an excellent compromise between array size and performance.

6.9 Conclusions

Switchblade's virtual logic system permits logic arrays to be regarded in a whole new light. No longer are they restricted to executing a single task at a time under the support provided by an external general purpose processor. No longer must a program be hand-crafted to work with a logic array of a specific size. Switchblade programs run just like programs in a modern general purpose computer - they are shielded from issues like the size of the logic array and the other programming running in the system.

The virtual logic system achieved several goals, none of which has been achieved on a logic array processor before:

  • Virtualisation of the logic resource

    • The ``contour'' system provides independence from array space limitations and logic layout in the same way that a virtual memory system provides independence from physical memory

    • Realistic multitasking is possible due to this independence from the physical layout of logic circuits

    • Hardware support gives the logic paging system reasonable information on page activity to permit good paging algorithms to be implemented

  • Support for massive parallelism

    • A packet-switched message passing mechanism is provided at a low level

    • The message passing system solves retiming problems between contours in an efficient way while having minimal impact on the logic array

    • Mutual exclusion primitives are available on a message port access basis

    • Built in flow control and buffering of messages makes synchronisation easy while maintaining throughput

    • ``Faults'' provide a mechanism to raise exception conditions which is effective within a massively parallel framework

  • Operating system support

    • Special support for sequential processing permits efficient execution of sections of code which are otherwise poorly suited to logic arrays

    • Run time configured ``Extensible operating system features'' permit less commonly used functionality to be added to message ports using configurable logic only when needed

    • Hardware-level security is provided through a password capability addressing system

  • Speed

    • The ``contour'' system inherently takes advantage of the run-time reconfigurable, multiple configuration store logic array system

    • Optimised location-dependent addressing uses run-time reconfiguration to keep routing logic fast and small

    • ``Short addresses'' and ``Wired routes'' provide optimisations for common special cases in routing