The trend towards using programmable logic arrays as computing devices has led to the question of whether they could ever replace general purpose processors. One obstacle to this is their inability to multi-task in any convenient way. A general purpose computing device should be able to run multiple tasks and should have some reasonable way to deal with large numbers of tasks executing simultaneously.
One problem with using logic arrays as computing devices is that they are essentially single-process devices. Traditional logic arrays are very slow to reconfigure, so slow that switching between configurations to change tasks would not really be viable. Even so, quite a lot of research has been done on run-time reconfiguration of current-day devices - the logic array is a scarce enough resource that in some applications it is worthwhile reconfiguring on the fly [DaalenJeavons93] [Daumas94] [Dehon94] [Dehon96a] [Eldredge93] [EldredgeHutchings94] [EldredgeHutchings94b] [EldredgeHutchings96] [FrenchTaylor93] [Foulk93] [GokhaleMarks95] [GuntherMilne96] [Hadley95] [HadleyHutchings95] [HadleyHutchings95b] [HastieCliff90] [HutchingsWirthlin95] [LemoineMerceron95] [Lysaght91] [LysaghtDunlop93] [LysaghtStockwood94] [RossVellacott93] [Saleeba93] [SinghBellec94] [SinghHogg96] [SchonerJones95] [Shand95] [Tau95] [WirthlinHutchings95] [WirthlinHutchings95b] [WirthlinHutchings96] [WirthlinHutchings97a]. Still, while reconfiguration can be done at runtime this does not serve to make multi-tasking more viable. In fact it demonstrates a desire to make even more use of the existing resources even for just a single task. Run-time reconfiguration references are covered in more detail in chapter 2.
Dynamically reconfigurable architectures such as DRAMA [Saleeba95] are able to switch between configurations rapidly, but are still limited by the fact that the circuit synthesis processes which generate the configurations are very slow and inflexible, and must be done in advance. Even though with these architectures it is quick to change between configurations, those configurations must be generated with a specific fixed location in mind, giving no real opportunity to place a circuit at a different location at runtime. This is unfortunate, since to be able to parallel process it would be convenient to have parts of more than one program running on the array at once. What's more at any given time there may well be unused space in the array available to accommodate an extra sub-circuit but this area may not be at the specific place that the sub-circuit was originally intended to appear at. Even if the circuit could physically be placed at the new location, there may well be no routing resources available to connect it to wherever it is needed. The fundamental conclusion is this - traditional logic arrays are not really designed with this kind of use in mind.
This chapter describes a new scheme which makes it possible to simultaneously execute independently routed tasks on the one logic array - achieving goals of multitasking, task-parallel processing and virtualisation of the logic array. The implementation issues which accompany this approach are discussed and a description of the design decisions behind Switchblade is given.
An ideal way of allowing multiple tasks access to the common logic resource of a logic array would be akin to the way conventional processors provide multiple tasks with access to the common memory resource - they use a virtual memory system or in this case the equivalent would be a virtual logic system. Sections of circuits in heavy use are paged in and running while latent pages are in some form of secondary storage and do not need to consume any space on the logic array at all. As with virtual memory, virtual pages can potentially be physically located anywhere in the logic array.
Much as conventional program code can be divided into pages and paged in and out, circuits can be divided up into partial hardware configurations and switched in when necessary. That much is intuitively obvious but unfortunately the reality is a lot more complex.
Circuits normally communicate with each other on many logic lines simultaneously. ``Paging out'' a section of circuit may leave a number of lines in an undefined state, with no knowledge of when the circuit needed to be paged in again when it was next needed. MMUs have no equivalent problem as they deal with essentially serial access to memory. The parallel nature of logic circuits makes it very hard to remove a part of a circuit without damaging the circuit to the point where it won't work at all. Clearly a more sophisticated technique than simple paging is necessary.
Worse than this, it's not really viable to locate pages at random in the logic array. The most scarce resource in a logic array is interconnect, and random placement of logic would lead very quickly to these interconnect resources being exhausted. It is convenient to be able to place virtual pages at random physical locations as this makes the paging algorithm simpler and more flexible, but some system of keeping related circuitry together as much as possible is necessary to keep interconnect small, fast and efficient.
Another factor which makes a logic paging system more complex than a memory paging system is the two dimensional nature of the data being paged. It is desirable to page sub-circuits in functional blocks, yet circuit synthesis algorithms will often create such blocks in ragged shapes which are very difficult to page efficiently. After a period of paging ragged-shaped circuits into random locations on a two-dimensional logic array much of the logic area would become unusable due to the jigsaw-like difficulty in slotting the circuits together. This is analogous to memory fragmentation in conventional computers, except complicated greatly by a second dimension.
The Switchblade system implements a virtual logic system based around a modified concept of paging augmented with a special inter-page communication system. The paging system assumes fixed-sized pages, each page corresponding to a set two-dimensional area of the logic array. Sub-circuits of an algorithm - called ``contours'' - are constrained to use at least one page, but may use any number of pages up to the size of the entire logic array.
Unlike paged virtual memory systems where individual pages can be paged in and out at will, this is not possible in a virtual logic system where intra-circuit connections must not be severed. A single page is not the quantum of switching - instead a contour consists of a number of pages which are all switched simultaneously, as a unit. Such ``contours'' are arranged in a fixed two-dimensional layout, which as the name implies may be any shape. Generally speaking rectangular shapes are preferred as they will cause less fragmentation but no restriction is imposed by the system. Interconnection and layout of pages within a contour is fixed, and no special care need be taken when passing data between pages. Effectively each contour operates as a continuous, uninterrupted two-dimensional area of logic.
For performance and placement reasons the number of pages in a contour should be kept to a minimum - the larger and more irregular in shape a contour is the more difficult it is to place in conjunction with other such contours. Large, irregular contours will result in bad fragmentation when the system performance degrades - further contributing to the drop in performance. On the other hand the cost of interconnection within contours is lower than that between contours, so there is a tradeoff between speed and array usage. The synthesis system must make reasonable decisions about where to place the boundaries between contours. Synthesis is beyond the scope of this thesis so algorithms for achieving this ``reasonable'' tradeoff will not be discussed here.
The contour table keeps track of each virtual page in the system and what status it has. The status of all pages in a contour is the same. This may be one of four states, as shown in figure 6.1:
|
A small algorithm should be able to operate entirely ``hard'' - all in logic at once and running at maximum speed. Larger algorithms which rely on Switchblade's multiple configuration stores will be have some contours ``hard'' and some ``stored'' at any given time. The system will cause a ``fault'' every time data is sent to an unavailable contour, perhaps causing it to be switched in.
A ``soft'' or ``swapped'' contour implies one of two things - either the contour in question is little-used or logic resources are so tight that it has been paged out prematurely. In some cases it will never be useful to have these sections of code running in hardware - they may run so infrequently and take such a disproportionate quantity of resources that they're better run in conventional sequential code. Note that this does not imply a conventional general purpose processor to execute the contour off-array. Instead the ``soft'' contour can be sequentially executed by the logic array itself as described in section 6.7.
It may also be advantageous to map a contour in multiple parts of the logic array simultaneously. This allows it to be conveniently located for different configurations of the circuits around it. These multiple places might not be used at the same time, but can be selected in turn as the environment of circuits around them changed.
Estimates based on simulation and observation of existing circuits led to a choice of page size on Switchblade of four by four logic blocks. This was judged to be coarse enough to keep the number of pages per contour down to a manageable level but fine enough to avoid excessive loss on the perimeter of contours. It also results in the number of physical pages in the system being of roughly the same order as in a conventional computer for the size of logic array chosen. This may not necessarily be meaningful given the differing natures of the systems but it is a comforting indication that the choice is probably in roughly the right ballpark.
Paging of hardware should not be confused with memory paging. A complete system such as Switchblade will have both forms of paging available to it. A conventional MMU provides virtual memory functions to the processor in much the same way that it does to a conventional processor. Sequential processes use the MMU to gain paged access to a virtual address space containing their code. Data in virtual memory can be accessed by either sequential or hardware processes.
More interestingly, configuration information can be stored with three levels of virtualisation - initially configured in hardware, then switched out to an alternate configuration, then paged out of the logic array into memory, then paged out of memory on to disk. This many levels of paging hierarchy allows paging to occur ``softly'' - paging a contour down a single level of the hierarchy can occur without incurring excessive costs in paging it back in. Since resources at the lowest level - the logic array level - are very scarce, a large amount of high-speed paging is to be expected at that level so it must be handled quickly.
Communication between contours is more complicated and slower than communication within contours. This is due to the fact that the data sources and destinations for a given contour may or may not be paged into the system at any given time, and even if paged in might be located anywhere on the array.
A dataflow approach is used to resolve this issue. Peripheral logic lines are regarded as data sources and sinks and contours are regarded as functional units. Data are passed between two functional units directly if they are both paged in, otherwise a ``fault'' is generated when an attempt is made to pass data to a contour which is not available.
When a fault occurs the data must be passed somehow to the destination contour. The most convenient case is if the missing contour is just in an inactive configuration store somewhere on the logic array. The paging algorithm will cause this configuration store to become active and will ensure that valid interconnect exists between the two contours. When the new contour is available the waiting data will be presented to it and computation will continue.
If the destination contour is not available in the configuration store one of two things may occur - either the new contour may be configured in or it may be executed sequentially instead. Which of these is done is entirely at the discretion of the paging algorithm.
Existing between contours are the mechanisms which provide this ability to do asynchronous inter-contour communication and to raise faults when necessary. This feature is designed to give the advantages of direct hardware connection for heavily-used communication lines along with the economy of a routed message passing system for less frequently used ones. This mechanism is described in detail in section 6.5.
In essence, contours communicate with each other by means of message ports. These message ports may be configured to communicate either in a connection-based fashion or in a datagram fashion. They transfer a configurable number of bits with each transaction, and have a handshaking mechanism to control the flow of data. They can also be configured to send fault messages to the operating system under certain conditions. This message passing mechanism provides all the necessary interlock features for communication between contours.
In any parallel processing system some form of mutual exclusion is required to ensure that multiple processes don't modify data at the same time. Within a contour this is not a problem as the logic circuit can be designed to be internally consistent through the use of synchronous or timed logic. Communication between contours however can cause synchronisation problems as all inter-contour messages are asynchronous, making it impossible to be sure in advance exactly what order events will occur in.
The message passing mechanism provides the fundamentals of synchronisation between tasks. A resource which subject to competition will have an access control contour to prevent uncontrolled contention for the resource. This controlling contour will often be very simple indeed as the message port interface provides an inherent mutually exclusive arbitration mechanism.
The system is designed around the ability to switch code in and out of hardware, as well as in and out of configuration store and in and out of memory. This unusual four-level paging system requires quite a complicated paging algorithm, so it is important to provide it with as much information as possible about the operation of the system so it can make sensible decisions about how to operate.
It is quite difficult for the operating system to determine if a contour is performing a computing function or if it is just in a wait state. Contours in wait states are obviously prime candidates for paging out, but without accurate knowledge of how active contours are it is difficult to make these decisions correctly. The following section describes some methods which can be used to gather this information in support of the paging algorithm.
The most direct hint that can be provided to the operating system is when a contour attempts to write on a port and is unable to due to the read port's queue being already full. This causes a fault which the operating system can use to switch the contour out if it wants to.
Note that this doesn't necessarily mean that the contour isn't doing anything - it might well be still computing even though its write is blocked. Still, in most cases a blocked write will mean that the contour is unable to proceed so an arbitrary rule is made that being blocked on any write port is implies a contour's willingness to be paged out. This affects the way that contours are designed - generally speaking buffering between contours is better left to operating system buffers rather than providing the buffers in contour logic. This way the operating system can understand better whether the contour is idle and will cause fewer cases where the contour is switched out when it shouldn't have been.
Correspondingly, when an overflowed queue is emptied to below its low-water mark a fault is sent to the operating system to request that the communicating contour be switched back in. The paging algorithm may or may not decide to page the contour in immediately, depending on load. It might for instance elect to continue executing the contour as a sequential process.
The low-water mark depends on the implementation of the read queue. The default read queue has only a single buffer entry available so its low-water mark is reached when this single entry is emptied. This will cause a fault to be sent which will in turn cause the source contour to be paged back in and hence prompt delivery of the following message (which originally overflowed the queue and caused the contour to be paged out). This means the buffer isn't likely to be empty for long, but a buffer overflow fault won't occur until yet another write is attempted by the source contour.
Each read port or write port can have a counter attached to it which keeps track of the number of transmissions on that port. This can be used to estimate how heavily a contour is being used. When the counter reaches its maximum it stops counting until cleared by the operating system. The size of this counter is configurable, but often a single bit will suffice - indicating that the port has been used rather than giving detail on how many times it has been used. The decision of how wide to make these counters depends on how often the paging algorithm is going to check and clear them. A wider counter is more useful if the system is checking infrequently.
All of the methods mentioned to this point are imprecise - there are cases when contours will be misidentified as latent and will be erroneously switched out. This could affect the performance of the system adversely or in a pathological case cause an almost-deadlock situation where two communicating contours keep each other's read queues full and cause themselves to be switched out before they have a chance to clear their own read queues. This case would eventually clear itself due to the finite time that it takes the operating system to switch tasks, but the performance would be very poor.
A far more effective way of keeping the operating system informed of the status of a contour is to have the contour provide that information directly. If a contour is incapable of continuing it can send a ``wait fault'' to the operating system saying that it wishes to be paged out until further notice. Note that this fault will not necessarily cause the contour to be paged out immediately - it only provides advice to the operating system that it is a good candidate for paging. The paging algorithm may have other information which causes it to choose not to page the contour out - such as there being no great constraint on logic array space or statistics indicating that this contour is likely to be used again shortly.
Similarly, if a contour doesn't want to go to the effort of working out when it is stalled and explicitly requesting page-outs it can perform an equivalent function by doing the reverse - telling the operating system when it's busy. This can be done using the normal message counter mechanism in a slightly unusual way - a message port is configured to carry no data and to lead nowhere. Its only function is to increment a port usage counter or set a usage bit.
In fact ports without any data are not uncommon - they can be used to handshake with other contours. This can be used to pass control to a successor contour, for instance. A port which transmits to nowhere is more unusual - it requires a special address which is resolved by the operating system to only connect to a counter and not involve transmission at all.
This combination of features - the wait fault and the activity message provide enough information for the paging algorithm to do a good job of selecting contours to page out. They do however require either explicit code from the programmer to support them or part of the programming system to automatically generate the appropriate hardware. This is discussed at greater length in chapter 4.
In practice, while the basic mechanisms for all these techniques were implemented on the Switchblade simulation, operating system support was only included for the more convenient ``usage bit'' method. Algorithms to support more techniques could have been added but as part of the operating system these were judged to be a subject for later research beyond this thesis.
There are normally only three conditions which will cause a contour to wish to resume execution after it has been paged out. If the contour was paged out because of a blockage on a message write it may be paged back in when the blockage is cleared. If the contour was sitting idle without sufficient data to keep it running it may be paged back in when a message is sent to it. Alternatively the contour may have been paged out simply because the logic array resources were becoming scarce and the system is waiting on processor resources becoming available before the contour can be paged back in.
All other wait conditions are variants on the three above circumstances. A contour waiting on disk I/O for instance is more immediately waiting on a response from the disk device driver, which will come to it in the form of a message on a read port.
It can be argued that hardware interfaces have access to a different form of I/O and hence they might find it useful to be able to wait on that stimulus specifically. This brings us to an important variant on contours - contours that are ``wired down''. These contours are part of the operating system and are designed to operate in a specific physical location in the logic array - usually in association with an I/O port. They can not be shifted in location or paged out. Hardware interfaces are ``wired down'' in the interests of quick response and so that their access to hardware will never be broken by their being paged out. Often these wired down sections will only be small interfaces which send messages to more complex device driver contours through the normal message passing mechanisms.
A contour which has emitted a wait fault to the operating system will have a specific reason for doing so - it may be waiting on a blocked write port clearing or it may be waiting on more input becoming available on a read port. The wait fault message which it sends can specify exactly which stimuli it is interested in - or rather, which read or write ports it's particularly interested in hearing from. Other stimuli will be ignored, or rather will not cause it to be paged back in immediately.
If a message on a read port is ignored it may still indirectly cause a contour to be loaded. If a number of successive messages on the port are ignored the read port's queue will fill and may cause another contour to become blocked. This may result in the paging algorithm choosing to page the contour back in, in an attempt to clear the blockage on that read port.
Contours which have been chosen by the paging algorithm to be paged out of hardware are not necessarily unable to continue execution even if it appears that way to the paging algorithm. It is quite possible that they have been poorly selected and are ready to continue operation at full speed. A contour may have been paged out due to a blockage on a write port, but in fact be capable of accepting more data on input and performing useful computation on that. It is very difficult to determine if contours are in these kinds of states.
One solution is to occasionally run paged out contours as software tasks rather than hardware as described in section 6.4.3 - the software version of the code can quickly determine if the code has reached a state where it cannot continue without a change in external stimuli. If this state has been reached the operating system can safely place the contour on a ``blocked'' queue and wait for appropriate input.
Often a contour will not be able to continue computation until a
number of conditions are met - usually until messages have been
received on a number of read ports. A possible optimisation is to
prevent contours from being paged in until all their required stimuli
had been received. This prevents cases where a contour was paged back
in on receipt of a message, only to be paged back out because another
message was required for the contour to be able to continue. This
has been empirically tested for hand-coded cases but it isn't clear
whether it's practical to use this technique with automatically
synthesised contours.
6.4.3 Sequential code
Unlike paging systems on conventional processors, there are actually several levels of paging in this architecture. Code is not only paged between memory and disk, it can be paged from memory to the logic array, and paged from the logic array to backing configuration store.
Another option with this flexible system of paging is to take a contour which has been executing in hardware and continue its execution in software. This is made possible by the fact that both hardware and sequential software versions are available for all code. Code which is executed infrequently is a prime candidate to be paged back to sequential code. No special hardware is provided to aid this transition - the function is provided by the operating system.
More detail on sequential execution is given in
section 6.7.
6.5 Routing
Routing between logic blocks takes three forms - local routing, medium-distance routing and contour routing.
Local routing and medium-distance routing provides connections between logic blocks within a contour. Local routing connections lead directly from each logic block to adjacent ones, and are the fastest option. Medium-distance interconnect can be used to carry signals efficiently over longer distances, still within a contour. Both these forms of interconnect are described in the previous chapter. They have no higher-level function - they act as wires connecting the logic blocks, and have no ability to raise a fault to the operating system.
Since configured routes share the same configuration store with everything else, it is possible to reconfigure routes directly from the logic array. One way that this is useful is in creating applications where routes are dynamically reconfigured at run-time. Alternatively configurations can be switched-between under the control of the hardware, allowing rapid changes between entire circuit layouts.
Contour routing offers a whole new form of communication within logic arrays. It operates at a higher level than the other types of routing, providing location independent routing of messages between contours. This effectively virtualises the logic array, making it a homogeneous substrate upon which contours can be placed and relocated transparently.
Both local routing and medium-distance routing are intended for use within contours only. Within contours we can use a circuit synthesis algorithm to create an efficient layout at compile time. Between contours we are reliant on real-time routing which is much slower and more expensive of resources.
The form of communication between logic blocks within a contour also takes a very different form from communication between contours. Within a contour an algorithm is represented directly as a logic circuit, with communication passing down logic lines from anywhere to anywhere else at the whim of the synthesis software. Communication between contours uses a dataflow model as described in the next section.
Circuits within a contour are location independent in as much as they can be placed anywhere on the array and still operate in an internally consistent manner. Their function is not dependent on their location in the array (with the exception of I/O, of course). What prevents them from being individually relocated is the ability to then connect them back into the larger scheme of things. Contour routing fills this gap in current architectures.
Contour routing is more flexible than local forms of routing, at the cost of speed. Transmission from one contour to another is independent of both their locations in the logic array, and independent of their location relative to each other. This allows the paging algorithm to place contours wherever it likes within the constraints of routing resources.
Between contours the form that communication takes is based on the transfer of data between contours rather than the logic-levels usually used within a contour. It relies on dataflow analysis from the compiler to define the form of communication to and from each contour. Transmission of a datum is a tightly-defined action involving a group of logic lines being set to a state at a particular time, rather than just allowing logic lines to change state when they will.
Since dataflow deals with quanta of data which are ``results'' - i.e. numbers, strings or whatever the algorithm deals with - these are the appropriate quanta to pass between the contours. Rather than allowing logic lines to change independently of one another, an entire result is sent as a package.
This is not to say that the system models a dataflow architecture - rather dataflow analysis is a suitable technique for dividing an algorithm into contours to make it convenient to pass results around in this form. Communications can take place at the higher level rather than using the free-form logic routing used at the lower level within contours.
Another advantage of passing data packaged in this way is that it permits a whole range of functions not normally possible with conventional logic routing. Normal logic lines need to be connected all the time - who is to say what value a logic line has when it isn't connected, or whether a clock signal can safely be left disconnected from a circuit which is not in use? To permit dynamic paging-in of circuits when they're needed it is convenient to be able to discriminate when actual solid results are being passed from one contour to another. The dataflow technique provides just this.
An incidental effect of the dataflow nature of communication in the contour routing system is it makes ``demand loading'' of contours possible. If initially only the starting contour is paged in it will execute in isolation until it sends a result to another contour. Computation will then be blocked until the new contour is available, at which point processing will continue until either of the contours sends a result to an unavailable contour. This process will continue until a working set of contours is established. This is just one facet of the way contour routing virtualises the Switchblade logic array.
The message ports used by contours to transmit and receive data are really connections to a nearby routing junction. When a contour attempts to send a result to another contour, the junction has the job of forwarding the result to the appropriate destination or raising the appropriate fault to cause the destination contour to be paged in.
At the lowest level routing junctions communicate with a small number of logic blocks in their vicinity. These junctions have access to a longer-distance routing grid which they use to communicate between themselves. Physically speaking, junctions are a thin section of specialised logic between the normal logic spaced at every eight logic blocks. They have their own hierarchical interconnect structure.
The junctions in Switchblade are divided into groups, each of which has its own internal communication structure. Communication between groups is performed by another larger of junctions above the lower level, using another layer of communication lines. This hierarchical structure can continue theoretically forever, with each successive layer providing longer-distance communication over slower links. The higher-level links are also a more rarefied resource, and should be used sparingly. The paging algorithm has the task of ensuring that communicating contours are placed as close together as possible, both to reduce propagation delay and to conserve the higher-level routing resources.
Figure 6.2 shows the physical layout
of the routing circuitry on the die.
Junction groups provide areas in which communication is inherently
faster. Since large contours can easily bridge junction boundaries, it
is possible for contours to communicate with other contours in
multiple junction groups simultaneously without using higher levels of
junction group.
Central to each group of junctions is a parent junction which
forms part of the higher-level junction structure. Rather than
providing connections to surrounding logic blocks as the low-level
junctions do, these higher-level junctions communicate only with
other junctions and hence have a rather different structure.
These ``branch junctions'' have no message port interfaces, so they
are free of the interfacing complexities associated with these. In
their place they have sixteen child junction interfaces. These
interfaces have two streams, one to send messages down to the child
and the other to receive messages from the child.
A data path between two contours could be constructed by creating
a series of connections from the data source to the destination to
provide an electrical connection from the source to the destination.
This is equivalent to a circuit-switched telephone call, and is the
conventional scheme used in logic arrays. It does however have the
disadvantage of tying up scarce routing resources for the entire
period of the communication.
Rather than treating communication between contours as a logic
operation, Switchblade uses the more efficient scheme of
packet switching. Packet switching is a slower process than
propagation of a signal down a logic line, but it permits
multiplexing of many communications on the one set of inter-junction
logic lines. While heavily used data paths can use circuit switched
routing for speed, most data paths are used relatively infrequently
and won't suffer greatly from being packet switched. At the same
time this will produce an enormous saving in routing resources.
The hierarchical routing system used was chosen for its relative
simplicity and practicality. More sophisticated routing systems are
possible and may be a fruitful area for future study.
When a contour is initially created its references to other
contours are based on their function rather than their physical
location in the logic array. The other contours that a contour
communicates with may be located anywhere on the array, or may not
even be on the array at all if they're not paged in.
For this reason messages between contours are based on virtual
address, where the virtual address specifies the type of contour a
message wants to be delivered to and the port on that contour that it
should go to. The routing system translates the virtual addresses into
a path it can use to deliver the message using what is essentially a
small routing table at the parent junction of each junction group.
The virtual addresses used are formatted in the style of password
capabilities
[AndersonPose86],
containing a random component to ensure the
``unguessability'' of message port addresses. The capability access
control system is described in section 6.6.
Routing tables in successively higher levels of junction are
correspondingly larger since they must contain all the entries of all
their children. This is a classic problem in network scalability, and
no new method for solving this problem is presented here. A common
technique is to allocate virtual address space based on general
physical areas. This is not practical on a virtualised logic array as
it is necessary to be able to place contours anywhere. A more
applicable technique is to have a high-speed route cache in each
junction which keeps the more commonly used routes handy, while still
having the more obscure routes available if they're needed.
The solution used here is to have a software contour table which
provides the definitive reference for routing. When an attempt to
send a message fails due to a routing entry not being available a
fault is raised which causes a handler to check the contour table and
insert the appropriate routes into the appropriate junctions' route
caches.
Message passing involves sending a result from a message output
port to a message input port. In the most trivial case these ports
can be connected directly together by logic lines. This provides a
high speed form of transmission but is as expensive of routing
resources as conventional logic array designs. For low bandwidth or
long distance communication paths the result will be taken by a
junction, transferred through the routing system and will end up at
an input port. The entire transaction looks to both sides as if they
have been directly connected except for the time lag involved in the
transfer.
Message ports are asynchronous. Given the uncertain distance to
the destination and uncertain propagation delays which result,
synchronous messages are not effective in communicating between
contours. Within contours of course a great deal more is known about
the structure of the circuit so synchronous communication can be
used without problems.
A message output port consists of a number of ``result'' output
lines, a ``send'' output line and an ``acknowledge'' input line. When a
result is ready on the output lines the send line is strobed. When
the result has been received the acknowledge line is strobed by the
recipient and the sender may prepare to send again. This relatively
conservative scheme ensures correct operation over a wide variety of
response conditions, from the quick response of a directly-connected
neighbour contour to the slow response of a system which is thrashing
and may not even have both communicating contours in logic at the same
time.
This system is similar in concept to conventional asynchronous
logic. Instead of providing individual asynchronous logic lines
however it operates on groups of lines, and only for inter-contour
communication.
When a junction receives a message it firstly checks its routing
table to see if it knows what to do with it. If it doesn't, it
buffers the message, acknowledges it, and sends it to the parent
junction. If it sees that it already has a direct connection from the
source to the destination it takes no further part in the transaction
as it will already have been acknowledged.
If the junction finds that the message is destined for a logic
block connected to the same junction it presents the data to the
appropriate inputs on the logic block and passes the acknowledgement
back. If no acknowledgement is forthcoming in short order it will
latch the data and do the acknowledgement itself. If another message
arrives before the destination contour acknowledges the first message
the junction will raise an exception to the pager indicating that the
source contour can be paged out until the destination contour is
ready to receive more data. This is one basic mechanism for selecting
contours to page out.
Note that the junction system also solves the input retiming
problems which are often encountered in multiple-configuration logic
arrays. Rather than providing input registers on a per-logic-block
basis such as TSFPGA does
[Dehon96a]
[Dehon96]
the
buffering between contours is provided by the junction system on a
more economical per-contour basis.
Each routing entry has a small counter associated with it which
tracks the number of messages to a particular port. These counters
are periodically checked by the paging algorithm to determine which
ports would most benefit from faster transmission. If routing
resources are available, some routes will be replaced by direct
hardware transmission.
If a message reaches the top of the junction hierarchy without
finding a destination it can be assumed that the destination contour
is not currently in hardware. This means that it needs to be paged in
or alternatively that it should be run in sequential code. Either way
it passes out of control of the hardware at that point and the source
contour becomes a candidate for paging out if it attempts to send
another message to the same port before the destination has been able
to receive it.
Another possible cause of a message failing to be routed to its
destination is if routing entries for that destination are not
available at all of the junctions from the source to the destination.
This happens if the routing cache in one of the junctions has
expired the necessary entry. In this case the contour table must be
consulted to determine which path the message should take.
Message ports can theoretically have any number of data bits
associated with them, though in practice this means that a large
message port may be split over multiple physical port blocks. The
size of a port cannot change over its lifetime - an individual port
always transmits or receives the same number of bits.
Messages need not contain any data. A message port with no data
bits is merely a handshaking mechanism. This is useful for tasks
which use a number of contours in sequence. Each contour passes
control to the next contour by sending an empty ``finished'' message.
The message passing mechanism provides reliable, in-order
delivery of messages between contours. Order is guaranteed on a
point-to-point message stream but not necessarily in a parallel
system where messages are being sent from several places almost
simultaneously to a single message port. It also provides end-to-end
flow control in the case where a message is blocked from reception
for some reason.
These blockages usually occur if the receiving contour is not
accepting data as quickly as it is being sent. In the trivial case of
two message ports wired directly together the data is passed directly
from the source contour to the receiving one and the handshaking
lines are connected together. The handshaking lines provide flow
control directly from each contour to the other.
For longer-haul transmissions the receiving contour provides a single
buffer. A message received by the destination junction is latched and
presented to the receiving message port. The handshaking lines from
the receiving message port are used to clear the buffer and prepare
for reception of another message. If the buffer is already full when
another message comes in flow control is required to prevent messages
being lost.
Communication between junctions requires a slightly different
protocol to that used between message ports. When a message is sent
from one junction to another three handshaking lines are used -
``send'', ``acknowledge'' and ``blocked''. The ``send'' and ``acknowledge''
lines perform much the same function as described earlier. An
alternative form of acknowledgement - ``blocked'' - allows a junction
to refuse reception of a message if it is unable to deliver it.
Each level of junctions in the hierarchy provides this same form
of flow control, so a message stream which is travelling between
n junctions to reach its destination can potentially have up
to n messages backed up waiting for delivery, one buffered at
each junction along the path. This is shown in
figure 6.3.
A message travelling from one part of the array to a distant one
will travel up the hierarchy for a number of junctions and then
travel down until it reaches its destination. Since messages can be
backed up at each of these steps buffers are required both in the
upward-travelling direction and the downward-travelling direction.
Downward buffers are attached to routing table entries. Each
routing table entry normally contains information on the virtual
address it is routing and where to route it to. Adding buffering to
this means that each routing entry must keep track of the entire
address of a message, the contents of the message and some flags to
aid in flow control.
If a junction attempts to send a message down to a child junction
which responds with a ``blocked'' acknowledgement signal the junction
must retain the locally buffered copy and flag it in the routing
table as being blocked from transmission. Meanwhile the child will
have flagged the routing entry as ``waiting above'' - ie. the parent
junction is waiting on this entry becoming unblocked. If another
message for the same destination is received by the parent junction
it will mark its own ``waiting above'' flag and respond with a
``blocked'' message itself.
When a leaf junction is given the signal that the destination
message port has accepted a message it unblocks its routing entry. If
the ``waiting above'' flag is set it will send a special ``unblocked''
control message to its parent junction notifying it that the route is
now unblocked. The parent junction does not pass this message on - it
only uses it to clear its own blockage flag if it has one. If it does
clear a blockage flag and the ``waiting above'' flag is set it will
generate its own control message, and so the flow control signal will
propagate back through the system.
``Unblocked'' control messages are similar to normal messages except
that they use a special bit sequence which identifies them as a control
message. They contain no data bits - their address is sufficient to
flag that the route has become unblocked.
If a junction receives an ``unblocked'' control message for a
virtual address which isn't in its route table the appropriate route
entry has probably been removed by the operating system after the route table
filled up. This is not catastrophic - it causes a fault to be raised
to the operating system which can then deal with replacing the route entry from
memory and clearing the blockage.
It is important that a queue of traffic for one port not block
transmissions to other ports. The flow control scheme associates
buffers with the addresses in routing entries. This means that if a
message is routeable it is also bufferable and only messages to that
particular destination capability will be blocked. Messages to other
ports will go via other routing table entries and so will be
unaffected.
A disadvantage of this scheme is that message ports which share a
common capability (see section 6.6.5.5) will be
blocked as if they were a single port. This means that some care must
be taken with the design of such interfaces to ensure that the
situation doesn't arise where one port is blocked with a queue of
messages but can't be cleared until a message is received on another
message port with the same capability. This will cause the port to
become deadlocked.
For this reason it is advisable when dealing with message ports
sharing a common capability that each port address being used be
converted to a location dependent capability
first. An operating system call is provided to perform this conversion, and after
conversion transmission is performed using a
different method which does not suffer from this
limitation. Since location dependent addressing doesn't use the
routing tables for buffering there is no problem with a single table
entry being used for buffering of multiple message ports.
Buffering of messages which are passing up the hierarchy behaves
in a similar manner to those moving down the hierarchy. The difference
is that messages being passed up do not, by definition, have a route
so they can't be associated with routing table entries.
The solution to this problem is to create a ``blocking cache''
which retains the same information about blocked messages for upward
transmission that the route cache keeps about blocked messages for
downward transmission. The blocking cache is a small
fully-associative lookup table keyed on the capability password.
The information kept in the blocking cache is different from that
kept in the routing entries only in that the ``waiting below'' data has
a bit for each of the subordinate junctions rather than the single bit
``waiting above'' flag in the routing table. Each of these bits keeps
track of whether a particular child junction is waiting on a blockage
being cleared for that capability.
When a blockage is cleared the junction may elect to unblock any
one of the blocked children. A round-robin service order is used to
ensure that preference is not given to a particular child junction.
When a message with a previously unknown capability arrives from a
child, it is automatically entered in a free entry of the blocking
cache. If no free entries are available a fault is raised to ask the
operating system to free up some space in the cache. This is equivalent in
frequency to the fault raised when a route is not found for a message
to pass down the junction hierarchy to its correct destination.
The size of the blocking cache should be roughly equivalent to
that of the routing table as there will be roughly equivalent traffic
in each direction. While the routine tables and blocking caches may
sound expensive they are uncommon compared with logic blocks so they
do not consume as much of the logic array resource as might be
supposed. The prototype Switchblade architecture devoted an
estimated 10% of its total area to the virtualisation and routing
system. Part of the reason why this space consumption could be kept
down to a reasonable level was the frequent binding of virtual
addresses to location-dependent addresses at run-time.
Note that location dependent addresses have a rather
different form from normal capability addresses, and so buffering is
controlled in a rather different manner.
The contour table has a twofold purpose - to track the placement of
contours on the logic area and to allow routes to each message port
of each contour to be calculated.
When a junction's routing cache fills, the addition of a new
route means that one of the routing entries must be discarded to make
room for the new entry. When a message is sent which would normally
have used the discarded routing entry it will be forwarded up the
hierarchy of junctions until it is found to be unrouteable at the top
junction. This will cause a fault to be raised.
The fault handler consults the contour table to construct a new
route. The contour table is a sparse table - it maps from the random
capability addresses to the physical locations on the logic array
where each port resides. Since the table is sparse it is implemented
as a hash table. The page fault handler uses the data from the table
to ensure that each downward step on the route has an appropriate
routing entry. Steps up the hierarchy don't need a route since
messages are routed to the parent junction by default.
If adding a route entry causes a routing table to overflow the operating system
must decide which entries to discard. The contour table also plays a role
here - it keeps track of which message ports are in use and their
approximate levels of activity. These statistics are collected using
the port usage counters and by similar methods to
those used in conventional memory paging systems.
The contour table and its importance to the operating system are discussed in more
detail in section 6.2.1.
An important feature of any parallel programming system is the
ability to synchronise access to common resources. Hardware devices
for example should not be randomly accessible by any program in the
system since simultaneous access by multiple programs may cause
invalid control signals to be sent or even damage the hardware. It is
necessary to provide a method to allow programs to access resources
without stepping on each others' toes.
Within a contour synchronisation mechanisms can be efficiently
created directly in logic. Timing between contours on the other hand
is not guaranteed, so some other mechanism is needed for
synchronisation in this case.
The message passing system provides just such a mechanism for
mutual exclusion. Since all inter-contour communication takes place
using the junctions, these same junctions can provide the facility to
arbitrate competition for access to a port.
By default the operating system automatically arbitrates access to message
ports on a one-to-one basis - when a contour is started its read
ports will all be marked as available and each can be claimed by one
and only one write port. Any attempt to start a contour which needs a
port which is already in use will result in that write port not
receiving acknowledgement until after the port has been released, it
has successfully connected and an acknowledgement has been received.
In fact the paging algorithm may elect not to start a contour at all
if it detects that not all of the required ports are available yet.
Not all resource requirements persist for the duration of a
task's lifetime. A task may desire to request access to a port for a
short period, and then return the resource for other tasks to access.
This arbitration demands a slightly different kind of message port
which is capable of requesting access before attempting transmission.
This involves two more signals - ``port request'' and ``port
acknowledge'' - in the write port interface.
When the port request line is raised a message is passed
to the destination junction requesting exclusive access to the
destination port. If the port is unclaimed the junction will claim it
and reply with a port acknowledge message. Although the write port
sees this arbitration through a two-line interface it is completely
invisible to the read port as the system performs the arbitration
itself.
If a requested port is already claimed the system will stall the
port acknowledge signal until the request can be satisfied. This
means that a large number of requests could potentially accrue for a
port which is heavily in demand. The mechanism for dealing with a
queue of port requests is similar to that for dealing with a surfeit
of messages to a message port. One request is queued at the
destination junction and any more attempts to request access are
refused by the junction, causing them to be queued a step further up
the line. If enough requests are queued eventually attempts at port
requests will be unsatisfiable directly from a write port and a fault
will be raised to tell the operating system that the source contour can be paged
out.
Not all write ports have the port request feature - the logic is
only dynamically created for ports which require the feature. The
mechanism for this functionality-on-demand is described in
section 6.5.11.
Read ports which are very heavily requested may want to provide a
deeper queueing mechanism. A first-in first-out (FIFO) buffer at the
read port can accept a larger number of pending requests and satisfy
them more quickly than would otherwise be possible. As with the port
request logic, this functionality is available as a hardware operating system
extension.
The mutual exclusion facility described does not provide for
anything except for simple ownership. There is no inbuilt facility
for shared access, or for the common variant of shared read but
exclusive write access.
Generally such ``shared access'' systems are actually sequential
access systems at the lowest level anyway. Sequential access can be
easily achieved by doing accesses which request and release the port
with every transmission. This may be inefficient due to arbitration
overheads however, so a better alternative may be to provide a read
port for each of the competing write ports and provide logic for
whatever arbitration scheme is desired in the receiving contour. This
is at the discretion of the application programmer.
It is quite possible that the programmer of a contour may not
know how many contours will be competing for its resources and so
will be unable to provide the appropriate number of ports. In that
case there is one other possibility - ports which explicitly do not
require arbitration. These ports do not accept normal messages, they
only accept fault messages as described later.
Fault ports are used by the operating system to receive problem messages
generated by the system. These are analogous to interrupts in
conventional general purpose processors. Faults are generated under a
variety of circumstances relating to the virtual logic and message
transmission system. This is not their only application however - any
program can generate or receive fault messages for whatever purpose it
likes. A fault message is really just a datagram message which
contains both source and destination addresses. Faults are described
in more detail in section 6.5.9.
Fault read ports are just like normal read ports except that they
do not require arbitration before access and they normally send
the port address of the transmitting port along with each received
result.
By definition fault read ports will always accept messages
asynchronously from anywhere in the system, so they cannot be said to
arbitrate access. Some user-defined arbitration mechanism could be
layered on top of fault ports to deal with complicated shared access
requirements if the individual application required it.
Most read ports in the system are matched with a single write
port at compile time so there is never any competition for access.
Ports which use the port request feature can make their access
requests at random however, leading to potential deadlock situations.
One situation is if a contour has two read ports which are used
to accept separate parts of a single request. If two different
contours were to simultaneously request connection to this service
and through random bad luck in the timing of their requests one gained
access to the first port while the other gained access to the second,
they would both end up waiting indefinitely for the other to release
its claim on the opposite port.
The simplest way to resolve this type of deadlock situation is to
have only one read port which combines the function of the two
smaller read ports. Since there is no size limit on ports this is an
effective answer in cases where combining ports has no unfortunate
side-effects.
Another way is to ensure that all port requests are made in a
predefined order - such as by increasing order of contour address
followed by port address. This might be effective, but is difficult
for programmers to achieve since the numbering of contours is chosen
by the system and is not explicitly available to the programmer. Even
if this information was made available it would be inconvenient to
incorporate and the system-chosen numbering could well imply an order
of resource use which was not actually what was desired the program.
This method was not used in the Switchblade prototype as it was
deemed too inconvenient to the programmer.
The system provides a mechanism for reserving a set of resources
as an indivisible operation. This is not a hardware function - it is
provided by the operating system. It attempts to claim all the ports which have
been requested and if any request fails it immediately releases any
which it has succeeded in claiming. It then delays until the
offending port is released and tries again.
This is by no means a foolproof scheme but it does reduce the
danger of deadlock greatly in many situations.
Faults are used by the system as an equivalent to the hardware
interrupts found in most general purpose processors. Unlike in
general-purpose processors these faults can be handled either in
hardware or software, or a combination of the two.
Faults take the form of messages, and have a similar form to that
of normal inter-contour communication. The most important difference
is that while most port-to-port communication is one-to-one, faults
can be generated from anywhere in the system. To identify where the
problem occurred the address of the source contour is sent along with
the message. This information is used in much the same way that stack
context is used by an interrupt handler on a conventional processor.
Like interrupts in normal processors, faults can be generated
internally, externally or in code. Internally generated faults
include message queue faults (generated when a message port has a
backlog of messages) and unrouteable message faults (generated when a
message can't be routed to the addressed destination). Internal
faults are generated by the hardware itself in response to a
condition which the system has detected.
External faults are generated by an input logic level. They
perform much the same function as an external hardware interrupt on a
conventional processor, causing control to be transferred to a
selected piece of code on reception of an external logic transition.
Unlike hardware interrupts they do not stop execution of other code -
since the logic array processor is inherently parallel in operation
an external fault just means that the appropriate fault code will be
executed. If it has some system administrative function to perform
such as task switching it may end up stopping a task, but this is not
part of the standard behaviour.
Faults can also be generated in code. In conventional processors a
software interrupt or ``trap'' can be used to transfer control to a
system routine. A similar effect can be achieved by generating a fault
directly from a message port on a contour. This differs from sending a
message on a normal message port only in that the address of the
source contour is sent along with the message.
Faults are really just an extension of the inter-contour
communications mechanism. They provide a facility for sending source
address information along with a message. This changes the format of
the message from (destination, result) to (source, destination,
result). Just as interrupts are an extension of the subroutine call
mechanism in a conventional processor, faults are an extension of the
message passing mechanism in this architecture.
Interrupt mechanisms traditionally have an associated set of
interrupt masks. These masks have two main functions - to prevent a
low priority interrupt from taking precedence over a higher priority
interrupt and to provide a simple mechanism for mutual exclusion
between interrupt handlers and other code.
Given the parallel nature of tasks running on logic array
processors there is no need to prioritise faults. Since multiple
fault handlers can usefully operate at the same time there is no good
reason to prevent them from doing so.
This is not to say that mutual exclusion between fault handlers
isn't an issue. However the system already provides general
primitives for mutual exclusion which are quite adequate to the
purpose. With these primitives already available it isn't necessary
to provide a specialised mechanism such as interrupt masks for the
same purpose.
Conventional processors usually have a numbered set of interrupts
which cause calls to the appropriate interrupt handlers found in an
interrupt vector table. The table provides a mechanism for
translating a numbered vector into the memory address of the handler.
No such mechanism is necessary in a logic array processor as messages
to contours are already addressed by a virtual address. If a set of
contour numbers are reserved for faults then the existing routing
mechanism will automatically translate the address and nothing more
need be done. Message ports need only be configured to generate a
fault message with the appropriate fault number as the destination
address.
A efficiency problem with this scheme is that routing information
must be provided for faults as well as conventional messages. Since
faults are often relatively rare their routes won't appear in the
route cache and another fault will be generated to add the route to
the route caches of all the junctions on the fault's path. This
implies a considerable overhead which may well be wasted if the fault
doesn't occur again within a relatively short period of time.
The ``Location dependent addressing'' feature
provides a much more efficient way of handling this kind of access,
resolving the virtual number to a hardware route. This translation is
automatically performed by the operating system at its discretion. Location
dependent addressing obviates the need for an entry to be created in
the route cache of each junction along a communication path.
``Short addressing'' further optimises the address
for specific operating system faults to a short form which uses less array space
and is even more rapidly routed.
An important feature of interrupts in conventional processors is
that they save the state of the currently-running task and can return
to it later as if nothing had happened. Unfortunately this kind of
approach is not appropriate in a logic array processor. Since logic
arrays are essentially parallel it is not clear whether tasks should
be halted at all. Moreover, saving state is not just a matter of
pushing a few registers on to the stack - the entire set of variables
of the program could potentially be in hardware so saving the state
may end up being a lengthy process. Even if the entire program state
was saved, there is little evidence to suggest that this would serve
any useful purpose anyway.
It is clear that faults and interrupts have quite different
properties. Still the need to be able to alert the system to
emergency conditions remains. We also need a way to prevent a program
which has experienced problems from running out of control before the
operating system is able to deal with the problem.
For this reason an option is available on fault send ports which
both sends the fault message and halts the contour outright. The
contour will remain halted until the operating system fault handler explicitly
restarts it.
All contours have a master signal which tells them to halt. After
this signal has been raised they will have halted all internal
activity within a set period of time - 16 ``fast clock'' cycles in the
test system. Once the contour is halted its state can be examined by
a fault handler, swapped to backing configuration store or saved to
memory. The halting mechanism is provided by the logic array itself in
collusion with the circuit synthesis system.
The inter-contour communications mechanism provides for
one-to-one communication between contours and many-to-one
communication via faults. The remaining possibility is one-to-many
communication - in other words a message port which can send messages
to any message port address at will.
Access to this type of port opens a possibility for security
abuses. The ability to send random messages to random ports could
easily confuse the operation of an algorithm. Spurious messages sent
to a contour will most likely cause it to malfunction. The capability
security system described in section 6.6 provides some protection
against this form of attack.
One fundamental advantage of reconfigurable logic is that it can
be generated as needed. This feature is used by the system to create
system services only as needed.
For instance the junctions contain all the necessary hardwired
circuitry to do normal routing and message handling. They can
also internally generate faults. They do not have the ability to
accept external or code-generated faults though and they also can't do
dynamic destination selection. It isn't worth adding this extra
functionality to every junction since these features are relatively
rarely used.
These functions are provided by extra reconfigurable logic which
is created by the system as required. The logic generated for these
purposes is unique in that only the operating system is allowed to interface in
these special ways to the junctions - no contour has this ability.
This extra logic remains in the domain of the operating system - it looks to the
contour just like a different message port interface.
The ``port request'' feature available optionally on write ports and
the ``fault port'' types of read and write port are examples of extra
functionality available through operating system extension. Another is FIFOs.
Another extended message port function is FIFOs. These provide
buffering between contours.
In a situation where the source and destination contours of a
result stream have very different timing characteristics a FIFO may
increase overall performance greatly by storing extra results until
they can be consumed.
For example take a situation where one contour produces results
in brief spurts and then pauses for a while until it has calculated
the next set. Meanwhile the destination contour processes results at a
relatively constant rate. When the second result reaches the
destination contour it will still be processing the first, so the
source contour will be made to wait until it is ready to receive
another result. This will continue with each of the results slowly
being accepted while the destination contour processes as quickly as
it can. In the meantime the source contour is almost completely idle.
Once the last result of the burst has been sent the source contour
will go back into heavy processing mode for a while. In the meantime
the destination contour will be idle.
Placing an appropriate-sized FIFO between these contours allows
them both to run at full speed for most of the time. The source
contour produces its first set of results and these are buffered by
the FIFO. The source contour then continues with its calculations
while the destination contour churns gradually through the previous
set of results.
Modern processors provide the ability for the operating system to
restrict the access of user programs. This prevents user programs
from using resources which they do not specifically have permission
to use and from interfering in the operation of other programs. This
architecture is no exception. In fact the need for access control is
even more pressing in the case of a reconfigurable logic array
processor as a malevolent process could potentially damage the
processor itself by configuring it with an invalid circuit.
The security model for a logic array processor must be quite
flexible as the very nature of the processor allows some rather
unusual operations. It should be possible for instance to modify a
configuration on the fly, under the control of a user program. Yet it
should not be possible to create a damaging configuration. It should
be possible to transfer access rights to other subsystems such as
libraries. Yet there should be some mechanism for preventing rogue
processes from obtaining privileges they shouldn't have, and from
interfering with the operation of the rest of the system.
In this section I will use the terms ``security'' and ``access
control'' with slightly different meanings. ``Access control'' I will
use to mean ``provision of resources to those tasks which need them''
and ``security'' I will use to mean ``preventing tasks from gaining
unauthorised access''.
The basic method of access control is by capability. Each
resource in the system has a capability address. Access to a resource
is through its capability address. A capability address might
correspond to a message read port, for instance. ``Writing'' to that
address will cause data to be transferred to the read port of the
appropriate contour.
Unprivileged tasks will not be able to send to that port by virtue
of the simple fact that they don't know its address. Tasks can give
away access rights to other tasks by passing them the capability
addresses of whatever resources they want to give away access to.
Capability addresses have an equivalent in the folklore concept of
``true names'' - to use someone's true name was to have power over them.
In the case of capability systems the ``true address'' is the only method
to have control of a resource.
Note that simply giving away a capability isn't enough to ensure
that access rights will be transferred - many access rights are
considered non-transferable by the system. An example of this is the
default style of message ports which are paired when a task is
started and only become unassociated again when the task ends. This
gives no other port a chance to muscle in on their communication.
An essential feature of capability addresses is that they are hard
to guess. To this end the operating system deliberately generates
capability addresses as long pseudo-random numbers
[WallacePose90]
[AndersonPose86].
It is important
that a good pseudo-random number generator be used as it should be
impossible for even a malicious task to guess the sequence being used.
Capability addresses should be quite wide. The exact width used
is a tradeoff between the system overhead of handling large addresses
and the security risk of using shorter, more guess-able addresses. The
random ``password'' number is usually just a subsection of the
capability too - the remaining bits of the address may contain
additional information regarding a subsection of the particular
capability being referred to.
Switchblade uses capability addresses with a number of
fields. A standard Switchblade capability has 32 bits of
password capability address, two bits of capability type information,
and 30 bits of sub-address. Often the sub-address is ignored or only
a limited number of bits are passed through. Message ports for
instance only use four bits of the sub-address, so only four bits are
passed through the system to them.
Even with 32 bits of capability password it is conceivable that a
malicious task could try all four billion combinations in search of
the key which will unlock access to a resource. Once it had access it
could use the capability to gain unauthorised access to data, disrupt
operation or gain access to further resources.
The defence against this is that each attempt to access an
invalid capability causes a fault to the operating system. The transmission of an
invalid capability in the hardware routing system will cause it to be
sent step by step to the root junction as no junction will have a
routing entry for it. The root junction will then cause a fault to the
operating system saying that no route has been found. The operating system route handler will look
up the capability in its contour table, find that no such capability has
been allocated and raise a security fault.
Security fault handlers are responsible for detecting an
attempted intrusion and preventing the malicious task from continuing
its attempt. This is not as trivial as it may sound since it is
difficult to identify which task was the malicious one. The task
which passed the invalid capability to the system may not be the
malicious task, it may merely be a dupe task which was handed the
invalid capability by the malicious task. Terminating the task which
passed the invalid capability to the operating system would leave the system open
to a denial of service attack. Malicious tasks could pass invalid
capabilities to any program which accepted parameters and cause them
to be terminated when they tried to access the parameter
capabilities. The scheme used in the prototype system was relatively
simple and inadequate to prevent denial of service attacks.
Even though the prototype Switchblade system was overly
simplistic in its operating system support for capabilities it achieved its purpose
of providing enough architectural support to make the development of
a true capability-based security system possible. This type of
security system is a fairly well explored area so implementing operating system
support for a complete one should be a relatively straightforward
implementation task. Implementing a complete operating system is outside the scope
of this thesis.
There are a number of special capabilities in the system which
control access to fundamental system resources. Device handler
capabilities are some of these - another is the capability which
gives access to the junction configurations. Any task which has the
junction configuration capability effectively has control of the
entire system - the junction configurations contain most of the
useful capabilities in the system in their routing tables. It goes
without saying that such capabilities as that for junction
configuration are closely guarded secrets.
Generally speaking, each resource in the system has an individual
capability. Message ports are no exception - they are by default
addressed individually, with a different capability for each port.
A large system with many message ports will naturally have a
large number of capabilities to keep track of, and each capability
which is in hardware has a significant cost in routing table entries
and message port hardware. Added to this is the problem of tasks
which want to pass capabilities to other tasks. It may be desirable
for them to pass access to a set of related ports with a single
capability rather than having to pass a number of individual
capabilities to a task.
For these reasons it is possible for a number of message ports in
the same junction to share a capability password. They can then use
sub-addressing to determine which of the message ports within the
capability is being referred to.
Capabilities provide an effective and flexible method of access
control. The disadvantage is that they are also quite expensive in
hardware. Since capability addresses are very wide - 64 bits - the
hardware necessary to handle such wide addresses for each message
port could well be greater than the hardware handling the rest of the
message port. To reduce this expense most message addressing bypasses
the capability system altogether. While all user programs see the
system as operating entirely on capability addresses, in fact many
addresses are converted to hardware locations by the paging system
when a contour is loaded into hardware.
This optimisation uses a different style of address which is more
efficiently routed than the full capability address. Unlike
capability addresses these addresses make no attempt to be
location-independent - they only work when both ends of a message
link are in hardware and the address is entirely dependent on the
locations of the communicating contours in the logic array. These
addresses only exist for as long as the two contours remain in
hardware, and in those specific locations.
Most write message ports on a contour communicate directly with a
corresponding read message port on another contour and are connected
to this port for the entire lifetime of the task. The paging system
is aware of the capability address of each message port and of the
links between all the contours in a program. When a write on a ``hard''
message port causes another contour to be brought into hardware the
paging system automatically converts the send address of the writing
port to the location-specific address. These short-form addresses are
transparent to user tasks - they are hidden in the configuration of
the junctions and cannot be accessed by the user task at all.
Location-specific addresses are prefaced by a bit string which
identifies them as being different from full capability addresses.
The remainder of the address is in effect a routing path for the
message to take through the system. Translated into English, this
type of address might equate to for example ``up to parent junction,
down to fifth child junction, send to ninth read message port''. This
style of addressing is significantly shorter than a full capability
address and doesn't require the same complexity of junction hardware
as the full capability addresses. Since the paging system attempts to
place contours in such a way that they are located physically close
to the contours which they communicate with the addresses are kept as
short as possible.
In fact since full capability addressing is the exception rather
than the rule, many leaf junctions may not have hardware for routing
capability addresses at all. The basic leaf junction hardware in
Switchblade doesn't include the expensive translation tables
necessary to convert capability addresses into port addresses. If
such hardware is required it can be added to the junction using
reconfigurable logic, and only for those ports which need it.
When algorithms are synthesised into hardware the intention is
that all addresses which are static for the duration of a contour are
automatically synthesised as location-dependent addresses. The actual
location-dependent address for a message is configured into the
message port by the operating system at load time or when the destination contour
is moved into logic. The prototype system used an on-demand method
for this, configuring port addresses lazily when an attempt was made
to access them. The default unconfigured state of a message port
caused a fault to be raised and the runtime loader handled the
back-patching of the contour from there.
It is important to minimise the distance which messages must
travel to reach their destinations - reducing the number of hops not
only reduces the length of the address and hence the space used, it
also reduces the transmission time.
Location dependent addresses have a rather different form from
normal capability addresses. Since the format of the addresses
specifies the path taken messages are freed of the need to have
corresponding routing table entries. This lack of routing table
entries makes buffering and end-to-end flow control for location
dependent addresses rather problematic as this functionality is
otherwise provided by the routing tables. Adding an equivalent
per-junction buffering mechanism is overly expensive of logic
resources and would defeat the purpose of using lightweight location
dependent addresses.
Instead, an entirely different method of flow control is used.
Each message is sent and acknowledged immediately at the sending
message port as usual. Buffering and queueing happens only at the
receive port, and it is assumed that ``enough'' buffering will be
provided to deal with most cases. Optionally-configurable
FIFOs are used for this purpose.
When a message arrives at a blocked receive port the only way of
dealing with it is to prevent the sender from sending any more until
the port is clear again. There is no direct way of doing this, so a
bit is set in the receive port to indicate that it has overflowed and
a fault is raised to the operating system. The fault handler is normally in
hardware since it must act quickly to prevent too many more messages
from being sent to the blocked port.
The fault handler's first action is to set a bit in the send port,
blocking it from further transmission. The actual behaviour of this
mode is to prevent sending of any messages and to not provide the
usual acknowledgement on the send port. The handler then queues in
memory any messages from that source port to that destination port. A
number of messages may have been transmitted before the source port
is disabled so the operating system must provide for reception and queueing of a
number of messages.
When the receiving message port is cleared (or the attached FIFO
drops below its low-water limit) another fault is sent to the operating system
informing it that the send port can resume transmission. The fault
handler transmits the queued messages to the receive port and if it
does not overflow again during this process the send port will be
re-enabled for transmission.
Fault ports may receive messages from a number of source ports, so
disabling a single source port may not prevent the reception of
further messages on the destination port. In this case each message
from a new port will cause a fault to be generated and the operating system will
disable each of the sending ports in turn while queueing the messages
for later retransmission.
Some ports such as those which provide system services or library
functions to user programs are far more frequently used than others.
Often though these services will be sufficiently large and complex
that they always run ``soft''. In that case the address can take a
special short form which raises a fault to the operating system directly, without
requiring the full capability address of the server port to be
specified. This requires each of these special ports to be registered
specifically with the operating system. User tasks cannot normally do this - only
privileged operating system services are permitted to use this limited
short-address space.
Another common routing optimisation is to provide a direct wire
link between two ports. This is the normal mode of communication in
standard FPGAs but in Switchblade it is less common since
routed message passing takes the place of wires in most cases. There
is no substitute for directly configured wires if speed is of the
essence however, so the system may decide to establish a wired
communications link between two ports. This kind of link can be
considered to be a special case of location-dependent addressing. In
this case the addressing takes the form of establishing electrical
connections from the source to the destination using the
configuration hardware. No ``address'' as such is stored in hardware as
the operating system creates the routes directly in configuration store.
Careful synthesis of communicating contours can mean that
high-bandwidth ports are usually located physically close to each
other. This makes wired routes short, inexpensive and fast. While
there was no specific provision for controlling contour layout in the
prototype Switchblade some knowledge of the loader's behaviour
meant that manually generated programs could be made to run very fast
in this way. A more sophisticated system might provide ``hints'' as to
what ports needed to communicate at high speed and provide synthesis
tools and operating system support geared towards this.
Each capability address has an extra sub-address which is not
used in the inter-junction routing process. It is designed to be
used for sub-addressing within a capability. For example a memory
address capability might give access to a page of memory and the
remaining sub-address bits give access to individual memory locations
within that page.
Sub-addressing gives the ability to confer access to a whole group
of resources with only one capability. It also vastly reduces the
number of capability routes which need to be stored in hardware,
reducing the hardware cost of supporting capabilities significantly.
Often the port addresses of a contour will all have the same
capability and differ only in sub-address. This is not a security
risk as most contours only communicate with other contours of the
same task. Since it must be assumed that you can trust yourself there
is no danger in providing access to all the ports through the same
capability. On the other hand contours which communicate with parts
of the system outside a task cannot be trusted and the ports which do
this dangerous communication are given safe addresses which protect
the contour from interference on its other ports.
In most cases this will not even be necessary at all. If a
contour can guarantee that all its read ports are automatically
connected to other ports when the task is started it can already be
assured that no interference with its operation is possible, and may
blithely quote its own private capability address to external
entities, safe in the knowledge that they cannot harm it.
Under some circumstances a task may want to give another task a
capability, but not want to take the risk that it will pass it along
to a third party.
It is not immediately obvious that this is necessary in the
ordinary case of two ports being set up to communicate with each
other - the read port is claimed for the duration of the task and
cannot be interfered with. Still, it's possible that if the second
task passed the capability to a third task then this task could use
the capability to interfere with a later instance of the first task.
For these reasons a mechanism is provided to create capabilities
which are non-transferable. When the system creates a virtual
connection between two message ports at task startup it creates the
connections to capabilities which corresponds to instances of
the message ports, not a persistent capability which is present
between executions of the task.
This method is used for connections which persist for the
duration of the task, but is not effective for fault ports as they
are not connection-oriented. An operating system extension can provide extra
functionality in the junction which allows fault read ports to select
which write ports they will communicate with. Messages from disallowed
ports will be transparently discarded.
The expense of this message filtering is not inconsiderable, so it
should be used only when security is a major consideration. In systems
without malicious agents this kind of situation is unlikely enough to
occur by accident that it ceases to be an issue.
Sometimes a capability will confer more than a single access
right on the holder. A memory capability for instance may give access
to a range of memory locations, and may permit read, write or
software execute privileges. A task may wish to pass some but not all
of these abilities to another task. A common example is where another
task is provided with read-only memory access to a shared memory area
along with some mutual exclusion capabilities to mediate
communication.
An operating system service is provided to create a restricted capability of
this kind from an existing capability. Note that access can only ever
be restricted in this way, existing restrictions can never be relaxed.
The ways in which restrictions can be applied depends on the type
of resource in question. With virtual memory it may be possible to
restrict the scope of memory access, but this may only be possible on
a per-page basis rather than on a byte-by-byte basis.
A task possessing a capability to a set of ports on a contour
might want to give away access to a reduced set of these ports. This
is more effectively achieved by providing the reduced set of ports
with a different capability in the first place, and so is not
supported by the system.
The ability of a program to be entirely executed on a logic array
is limited by the size of the program - if it's larger than can be
fitted on the logic array then parts must either be swapped or
executed as sequential code. As discussed in section 4.2, algorithms
can be expressed in a variety of forms, some of which will be more
expensive of array space than others. In cases where it isn't
possible to fit an entire program in the logic array at one time it
may be possible to execute the less heavily used parts of the program
as sequential code. Which parts of the program are to be executed
sequentially and which parts are to be executed in hardware can be
determined at run-time by the paging system.
In chapter 4 it was pointed out that the distinction between
sequential and parallel code is not absolute - gradations in between
are possible which trade off logic array space against faster (and
more parallel) execution. This aspect is not part of the
Switchblade simulation but is definitely a promising area for
further research. Switchblade provides only the coarsest level
of granularity - hardware or sequential.
All object code to be executed on the system comes in two
forms - contours to be executed on the logic array and sequential
code to be executed by a sequential processor if no space is
available in the logic array. The paging algorithm determines which
method is used to execute an algorithm at any given time.
Any algorithm which is executed in software is assumed to be
little enough used that no great performance hit will be taken by
executing it in software. This implies that the sequential processor
need not be particularly powerful, either.
Rather than provide a powerful external processor, the system
permits areas of logic to be configured to a special ``cooperative
processor'' mode. A very small amount of specialised logic is
provided in each logic block to optimise the array for a specific
configuration which is a sequential processor. Each eight by eight
logic block area of the array has a repeating pattern of this special
circuit, meaning that a sequential processor can be aligned only on
eight-block boundaries. At the cost of two percent extra space used
for the special logic it permits the sequential processors to be
squeezed into only 60% of the space they would otherwise require.
Any number of sequential processors can be present on the array
at once. Due to their bit-slice design they can also be ``stacked'' to
provide processors of greater word widths.
Processors have access to the configuration store, and will
commonly be used to create complex configurations on the fly as a
symbiotic hardware algorithm is operating. Since reconfiguration can
be a relatively complex and infrequent operation it makes sense to
conserve logic array space by executing it in software.
Instruction code for the processors can be accessed either
directly from configuration store or from an external RAM. Running
code from the configuration store saves on I/O and allows quicker
execution, but is less frugal of precious configuration space.
The sequential processors have a very simple four-bit RISC
design. They are bit-slice processors so they may be conveniently
cascaded to provide wider data paths.
In the standard configuration a processor has a simple
single-register architecture. By adding more logic blocks it can be
expanded to have more registers.
Many of the components required by the processors are common with
the functions provided by the logic blocks. The bit-slice processors
each have ALU, register and counter functions. These are also
available to the logic array as part of the conventional function of
each logic block when it is not configured to be part of a processor.
Similarly the support provided in each logic block for state machines
is shared by blocks which are configured as instruction execution
units when used in processors.
The instruction stream is interpreted by an instruction execution
unit which is simply programmed into a logic block - while some
special support is provided by the array for this function a level of
flexibility is retained to allow for flexibility in the instruction
set. More complex execution units can be created using multiple logic
blocks. A great deal of the data path routing functions in the
processor are taken care of by the use of a combination of the
multiple configuration stores and routing circuitry. When the
processor enters a state which requires a particular data path to be
set up it selects a corresponding configuration store which has the
appropriate routes programmed into it.
For larger processors a configurable number of logic blocks can be
configured as the program counter, depending on the address width
required.
It is quite viable to configure the whole logic array as a huge
number of parallel sequential processors. In some applications this
will provide greater processing power than using the array for a
smaller number of higher-speed functions.
One very desirable feature which was not implemented for
Switchblade is an ability to translate between variables in a
hardware implementation and those in a software version. This would
permit contours to run in either state depending on the amount of
logic available and the context to be transparently switched from a
hardware one to a software one or vice-versa. The algorithm to do this
would have to operate in collusion with data obtained from the
synthesis system. This is a worthwhile area for further research.
A simulation of the Switchblade virtual logic system was
used to assess its performance potential. Creating this simulation
was complicated by the fact that no circuit synthesis tools for
Switchblade currently exist. These tools are outside the scope
of the thesis and how to create them is a research area in itself.
As an alternative to a direct simulation a higher level simulation
was created. A large, complicated program of a type which is normally
poorly suited to execution on logic arrays was analysed and an
execution profile formed. This profile was then applied to the
Switchblade model and estimates of the system performance were
generated.
In any paging system the paging algorithm has an extremely
important influence on performance, and the virtual logic paging
algorithm used in Switchblade's is no exception. The algorithm
was created from scratch for the simulation using a relatively simple
random page replacement strategy. The algorithm for contour checking
and page replacement is shown in figure
6.4.
The preferred configuration number is the configuration store of
the calling contour. Contours are preferentially loaded into the same
configuration store as the contours which are most likely to precede
and follow them. This increases the likelihood that a working set of
contours will be placed in the same configuration store and hence less
configuration switching will be required. Note that since
configuration stores are switched on a per-page granularity it is
still possible to have two communicating contours simultaneously
switched in even if they are in different configuration stores.
Keeping them in the same store merely decreases the likelihood that
they will physically overlap, which would require fast configuration
switching for them to communicate and prevent parallel operation.
Loading of contours uses a set of free lists arranged in a
per-configuration-store basis. The free lists are indexed to
facilitate quick searching for contiguous areas of logic sufficiently
large for a contour to fit in. The prototype algorithm was not
particularly clever about fitting strangely shaped contours together
efficiently. Fortunately the vast majority of contours were very small
- only one page in size. See figure
6.5 for the algorithm.
For the simulation an application was chosen which had
characteristics uniquely unsuited to logic arrays. The application was
chosen for qualities of poor parallelism and a large code size. The
GNU C compiler ``gcc'' was chosen as an application with these qualities.
No attempt was made to exploit the parallelism offered by the logic
array processor - the run profile used in the simulation was in fact
obtained from execution on a conventional sequential processor.
Each function in gcc was translated into an equivalent area of
logic, to be regarded as a ``contour'' for the purposes of the experiment.
An estimate of the number of pages taken by each contour was used to
determine the space used by each contour on the array. Pages comprise an
8 by 8 area of logic blocks. The largest contour in the example
was 497 pages, necessitating a minimum array size of 512 pages.
The vast majority of contours were much smaller, mostly one or two
pages. For the following simulation results the application began in
a completely unloaded state and was demand loaded during execution.
Simulating the run by compiling gcc into a logic form and
simulating the logic for an entire simulated ``gcc'' compilation run
would have taken years of simulation time on the available equipment.
It was decided to vastly increase the speed of the simulation by
working from profiles of gcc running in software on a conventional
processor. These profiles were then used to guide the flow of
control between contours in the simulation. Since this simulation was
of the virtual logic system and not of the logic array itself the
lack of low-level simulation was not crucial.
The run profiles were generated by creating a specially
instrumented version of gcc 2.7.2.2 which included checkpoint logging
at the beginning and end of each function. These checkpoints included
timing information which was used to monitor how much processor time
was being consumed in each function. The times were not adjusted to
count a possible speedup due to the direct implementation of these
functions in logic. Keeping in mind that this is not a
simulation of the overall performance - only that of the virtual
logic system - accurate simulation of the speed of the logic system
was not a priority. Comparing implementation in logic with software
implementation on a conventional processor, generally a small speedup
might be achieved since many instructions in a normal program store
data or transfer data around and can be replaced by much faster local
wiring in a logic array. Rather than attempt to make predictions of
this sort for all types of code the assumption was made that
execution would take roughly the same time on each architecture -
which is probably true to within an order of magnitude in the worst
case and assuming that no advantage is taken of the parallelism of
the logic array. Since no genuine implementation exists it is
difficult to determine Switchblade's low-level speed
characteristics accurately, but they can be assumed to be roughly
equivalent to those of comparable FPGA systems if none of
Switchblade's special features are exploited. For non-parallel
applications FPGA systems generally run faster than conventional
processors in applications where a great deal of data manipulation is
involved and slower in applications where a great deal of
floating-point computation is involved (which is normally executed on
a heavily optimised floating-point unit). The compromise here was to
assume an average of approximate equity between Switchblade
and a conventional processor, and given that this high-level
simulation was intended to model the behaviour of the virtual logic
system and not the low-level performance of the logic array this was
considered an adequate approximation.
Each simulation run was guided by these basic estimates of how
long each function would take to run and at what points it would
transfer control to other functions. Each transfer of control was
simulated as a transfer of data between Switchblade contours.
The time taken to send data between contours was modelled, as was the
time taken to switch between multiple configuration stores and the
time to load in contours from external memory if they were not paged
in. At no point in this simulation was more than one concurrent
operation being performed - simulating the slowest possible
circumstance as far as a highly-parallel logic array system is
concerned. Since the simulation assumes that Switchblade
functions run at the same speed as the MIPS 4000 processor on which
the profile was executed, the best case speed would be equal to (or
marginally slower than) the original profile.
In a case where a very large array is available little switching
will occur and communication will pass between contours quickly. If
transmission of messages requires switching between configuration
stores some overhead is incurred and message passing takes somewhat
longer. In the worst case where the logic array is small and
destination contours are likely not to be configured anywhere on the
logic array at all OS intervention is required and the contours must
be loaded into the configuration store before messages can be
transmitted - slowing communication dramatically.
The simulation program used the paging algorithms described
earlier to reproduce as accurately as possible the effects of gcc
being run on Switchblade. As stated earlier this kind of high
level simulation cannot claim to be very accurate, however the
purpose of the exercise was to determine if the virtual logic system
was viable in general terms so estimates of the lower-level function
were judged to be adequate. The purpose of the simulation is to
discover the variety of conditions under which it behaves well. A
number of the following simulation results address the performance
of the system when various design parameters are modified and
examine the circumstances under which it performs poorly. Some
attention is also paid to how gracefully performance degrades in
these situations.
Figure 6.6
shows a graph of performance against logic array size
and the number of configuration stores. A performance of 1.0 implies
that the application was able to run at full speed without being
hindered by paging activity. As expected, performance gains are realised
both by increasing array size and increasing the number of configuration
stores.
For the characteristics of the example ``gcc'' application,
the Switchblade prototype's parameters of 1024 pages and
eight configurations are close to the optimal choice, providing high
performance and economical hardware cost. Of course the level of
performance will vary with the size and number of applications
running, but the ability to run gcc efficiently indicates that the system is
in the same order of performance as current-generation general purpose
processors.
Irregularities in the graph demonstrate a slight affinity of the
application for particular scalings of the architecture. Note for
instance that a seven configuration, 512 page system performs better
than one with eight configurations. This is due to a combination of
attributes of the application, the array space and the paging algorithm.
The two sources of overhead in the system are when pages are
loaded into configuration store and when configuration stores are
switched between. Of the two, loading configurations is vastly more
expensive. Loading a single configuration store for a single page in
the simulated system takes approximately 20 microseconds. Switching
between configuration stores takes a few nanoseconds, and
consequently has negligible impact on the overall performance as long
as it is kept to a sensible level. In this simulation the levels of
configuration switching were low enough that they were invisible
against the cost of configuration page loads.
Figure 6.7
shows the total number of configuration page loads
for each type of logic array. In the case of the 512 page, single
configuration store array the system was thrashing, spending 96%
of its time loading configurations and only 4% of its time computing.
Figure 6.8
shows the number of configuration switches occurring
during execution. Each switching operation affects an entire contour
at once and occurs within nanoseconds, so the effects are relatively
insignificant. The graph does however demonstrate the effectiveness
of the paging algorithm. More switching naturally occurs in
situations with more configurations or smaller logic arrays. The
relative randomness of the graph shows the complexities
of the application interacting with the paging algorithm at various
array sizes.
The characteristics of the gcc execution profile can be seen clearly
in figure 6.9.
The x axis is execution time, from program start at the
left to termination on the right. Each location on the y axis corresponds
to a single contour. The graph demonstrates the execution of each contour
over time. The first quarter of execution is parsing. Next follows a
brief phase of code generation and then optimisation. Further phases of
code generation are interspersed with more optimisation for the remainder
of the run. This diagram gives some idea of the changing code requirements
of the application over time.
Note that in many of the following graphs simulation time is
given in seconds rather than using ``ticks'' or some other simulation
time unit. This is quite deliberate as ``ticks'' are not particularly
applicable to this kind of system so the simulation was performed
approximating the performance of an implementation with current-day
hardware performance. The reason why ``ticks'' are not applicable is
that inter-contour communication is variant in the amount of time it
takes dependent on many factors including the distance on the array
between two communicating ports, the amount of data which needs to be
transferred and what options are implemented on the ports. There is
no central time tick which regulates the transfer of data around the
system at the lowest level - much of the port communications logic is
asynchronous so it isn't really possible to apply the unit of
``ticks'' to this simulation. Instead some estimates of the likely
transfer times were made given the design described in this chapter
and assuming a logic speed consistent with current-level technology.
The time taken by processing within a contour is also not necessarily
synchronised to any clock, but will take whatever amount of time the
logic circuit within the contour happens to take. For these reasons
seconds are used as the time unit for the purposes of this
simulation, even if the actual time used in a real device will differ
depending on the technology available. The timings are in any case
approximate and are not crucial to the experimental results, which
are more related to the performance degradation characteristics of
the virtual logic system than the outright performance of the
Switchblade system as a whole. The markings on the horizontal
axis of these graphs can therefore be considered an example for one
implementation rather than definitive. Other implementations would
have similar characteristics but a different scale on this axis.
Figure 6.10
shows the working set of the program over time. A relatively
small, constant working set during parsing later varies wildly between
optimisation and code generation phases. The number of pages in the working
set never exceeds 2800. This sample data was applied to an array of
1024 pages and 4 configuration
stores. This size was chosen as it demonstrates a point on the performance
curve where degradation is beginning to be serious. Note that the total
number of pages available is 4096, significantly higher than the theoretical
minimum of 2800 which are required to prevent thrashing assuming a
100% effective paging algorithm and no fragmentation.
The performance of the array over time is shown in
figure 6.11.
There is some initial slowdown as contours are demand-loaded. As soon
as the working set of pages related to parsing is loaded performance
quickly reaches the maximum. When the operating characteristics of
the program change paging begins and the performance is variably
degraded for the remainder of the run.
In reality the execution time of the performance degraded run
is longer than the 1.9 seconds of actual execution time. Page
configuration loads become a significant factor, extending
the run time to over 4 seconds.
Adjusting the working set graph for the effects of paging results
in figure 6.12. The earlier graph
(figure 6.11) showed the working set as
if executed on an infinite logic array. Peaks where the working set
is large pass quickly as large working sets don't degrade performance
on an infinite logic array. This graph shows the effect of limiting
the size of the logic array to a level where paging occurs -
execution takes significantly longer and the points where execution
is slowed the most are also those where the working set is largest.
Execution is almost as fast where the working set is small however,
for instance at around the 2.0s mark which corresponds to the 1.0s
area in figure 6.11. This confirms that
large working sets cause performance degradation, as expected.
The performance graph is similarly extended. In this case the
points of lowest performance tend to be stretched as the paging system
struggles to load in a new working set
(figure 6.13).
Figure 6.14
shows the amount of array space used against time.
Starting at zero when execution begins, it raises and then plateaus
during parsing. When code generation starts however the remainder
of the array soon fills and some contours begin to be paged out to make
room for new contours. Notice how the theoretical maximum of 4096
is never reached as fragmentation prevents the array from being
fully utilised.
The rate at which pages are loaded during execution is shown in
figure 6.15.
The maximum rate of about 45,000 page reconfigurations
per second is frequently approached, sometimes for extended periods
as large amounts of reconfiguration occur.
Page unloads on the other hand are less directly limited by the
hardware since they need not be physically copied out of the array,
only displaced by a new configuration or partially displaced.
Figure 6.16
shows where pages are removed from the array, peaks mostly
corresponding to peaks in page loading except at the start where the
array is still filling up.
Contour switching is plotted against time in
figure 6.17. Where the
array is still underutilised during parsing contour switching is very
effective in preventing page loads. Contour switching is high but page
loads are very low. Later, when the paging system is struggling to
keep the working set in configuration store switching is used sporadically
but often configurations must be obtained from off the array so switching
rates drop correspondingly.
Figures 6.18
and 6.19
show the rate of contour loading and unloading
over time. These mostly agree with the page load graphs, except that
in places where the page loads peak out the rate of contour loading
tends to drop. This implies that the places where performance is worst
affected by page loading are where the contours being loaded tend to
be large and have many pages. This is reasonable - very large contours
tend to require a large number of small contours to be paged out to make
space, which in turn causes these contours to need to be paged back in
again, causing further performance degradation down the line.
Simulation indicates that Switchblade's virtual logic system
is viable. A limited set of tests shows that the system performs as
expected from such a system and even when presented with tasks which
are more traditionally associated with general-purpose processors it
is capable of performing well.
A relatively simple paging algorithm performed acceptably, degrading
gracefully when resource limits were approached. More research into
paging algorithms for this two dimensional logic resource would almost
certainly result in improved performance, as would synthesis tools
capable of keeping maximum contour sizes to a lower level.
The choice of array size for Switchblade turns out to have
been a fortuitous one. The array size of 64k logic blocks, or 1024 pages with
8 configuration stores appears to be an excellent compromise between
array size and performance.
Switchblade's virtual logic system permits logic arrays to
be regarded in a whole new light. No longer are they restricted to
executing a single task at a time under the support provided by an
external general purpose processor. No longer must a program be
hand-crafted to work with a logic array of a specific size.
Switchblade programs run just like programs in a modern
general purpose computer - they are shielded from issues like the size
of the logic array and the other programming running in the system.
The virtual logic system achieved several goals, none of which has
been achieved on a logic array processor before:
Figure [6.2 - "The die layout showing virtual routing hardware"]
6.5.4 Addressing
6.5.5 Message passing
6.5.6 End-to-end flow control
Figure [6.3 - "A long buffering stream"]
6.5.6.1 Downward buffering
6.5.6.2 Upward buffering
6.5.7 Contour table
6.5.8 Mutual exclusion
6.5.8.1 Shared access
6.5.8.2 Fault read ports
6.5.8.3 Deadlock avoidance
6.5.9 Faults
6.5.9.1 Fault masks
6.5.9.2 Fault vectors
6.5.9.3 Saving state
6.5.9.4 Halting contours
6.5.10 Selectable destinations
6.5.11 Extensible operating system functionality
6.5.11.1 FIFOs
6.6 Security and access control
6.6.1 Capabilities
6.6.2 Capability addresses
6.6.3 Intrusion prevention
6.6.4 Message port access
6.6.5 Optimisation
6.6.5.1 Location-dependent addressing
6.6.5.2 Flow control with location dependent capabilities
6.6.5.3 Short addresses
6.6.5.4 Wired routes
6.6.5.5 Sub-addressing
6.6.6 Non-transferable capabilities
6.6.7 Restricted access capabilities
6.7 Sequential execution
6.7.1 Sequential processor architecture
6.8 Performance estimate of the virtual logic system
6.8.1 Paging algorithm
6.8.2 Sample application
Figure [6.6 - "Performance of various configurations"]
Figure [6.7 - "Page load overhead in various configurations"]
Figure [6.8 - "Contour switching in various configurations"]
Figure [6.9 - "Contour usage in gcc (refer to text for explanation)"]
Contour
(each pixel
corresponds
to two
contours)
time -»
Figure [6.10 - "Working set of gcc"]
Figure [6.11 - "Performance of gcc"]
Figure [6.12 - "Working set graph adjusted for slowdown"]
Figure [6.13 - "Performance graph adjusted for slowdown"]
Figure [6.14 - "Array space used over time"]
Figure [6.15 - "Page loads over time"]
Figure [6.16 - "Page unloads over time"]
Figure [6.17 - "Contour switching over time"]
Figure [6.18 - "Contour loads over time"]
Figure [6.19 - "Contour unloads over time"]
6.8.3 Experimental conclusions
6.9 Conclusions