This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
As transistor counts and clock frequencies increase, distributing a low-skew global clock becomes increasingly more difficult. Iyer and Marculescu studied GALS-based microprocessors and concluded that they could gain power advantages by allowing fine tuning of the supply voltages and clock speeds for different functional blocks and by eliminating the need for a global, low-skew clock.Crossing clock domains is the central problem in GALS designs. If the data for a flip-flop or latch comes from another timing domain, it could potentially violate the setup and hold requirements. Such a timing violation could cause a metastable output, in which the voltage level may be indeterminate for an unbounded length of time before settling to a valid level. However, it's possible to minimize the probability of metastability failures by using synchronizer circuits, which can be as simple as one or more flip-flops connected in series. Figure 13 shows a common two flip-flop synchronizer. Failure probability drops exponentially with settling time or, equivalently, with the number of flip-flops in the chain. Thus, synchronizers can provide mean times between failures (MTBFs) of millions of years or more if properly designed.
Figure 13. A two-flop synchronizer, showing metastability: circuit (a) and timing diagram (b).
Advances in semiconductor technology, novel circuit techniques, and innovation in computer architecture have resulted in rapid improvements in microprocessor performance. Today, hundreds of millions of transistors are successfully harnessed to build these increasingly complex devices. However, during the next two or three generations, high end microprocessor designers will face several major challenges. Without argument, one of the biggest challenges will be to keep power dissipation to reasonable levels. Higher clock frequencies and transistor counts have made power dissipation a major microprocessor design constraint, so much so that it threatens to limit the amount of hardware that can be included on future microprocessors and how fast they can be clocked. Another impending limit will be the global clock distribution design due to larger die sizes and higher clock speed.
Distribution of a high frequency global clock signal with low clock skew can be prohibitively expensive in terms of design effort, area, and power consumption under such circumstances. In addition, significant across chip and across wafer parameter variations are added sources of concern for future microprocessors. In such an environment, globally asynchronous, locally synchronous (GALS) designs provide several benefits through their use of separate, autonomous units:
The capability to independently configure each domain to execute at frequency/voltage settings at or below the maximum values. This allows domains that are not executing operations critical to performance to be configured at a lower frequency, and consequently, a GALS microarchitecture has the advantage that power can be saved.
Elimination of the need for careful design and fine tuning of a global clock distribution network. Through local clock generation units, the problem of dealing with clock distribution can be confined into several smaller domains. For example, the impact of parameter variations on clock skew will be confined within a domain, and thus will require less design effort and cost for dealing with clock skew.
The ability for each domain frequency to track with parameter variations. In the case of frequency, each domain can statically run at different frequencies (increasing effective average maximum frequency) by tracking the variations from Vdd noise, L, as well as from temperature. For example, if one of the domains has one sigma slow , the frequency can be lowered for that domain while the other domains can run with a relatively higher frequency.
Globally Asynchronous Locally Synchronous (GALS) Systems combine the benefits of synchronous and asynchronous systems. Modules can be designed like modules in a globally synchronous design, using the same tools and methodologies. Each block is independently clocked, which helps to alleviate clock skew. Connections between the synchronous blocks are asynchronous. Early work on GALS systems introduced clock stretching or pausing. When data enters a synchronous system from an asynchronous environment, registers at the input are prone to metastability. To avoid this, the arrival of data is indicated by an asynchronous handshaking protocol. When data arrives, the locally generated clock is paused: in practice the rising edge of the clock is delayed. Once data has safely arrived, the clock can be released so data is latched with zero probability of metastability on the datapath. used ME elements to arbitrate between the clock and incoming requests, which helped to eliminate metastability and introduced asynchronous wrappers, standard components which can be placed around synchronous modules to provide the handshake signals and make them GALS modules. The local clock generator is constructed from and inverter and a delay line, similar to an inverter ring oscillator. The problem with using inverters alone as a delay line is that it is difficult to accurately tune the clock period as process variations and temperature affect the delay. Hence accurate delay lines have been developed which are capable of maintaining a stable clock frequency.These use a global reference clock for calibration. The former can use either standard cells or full custom blocks for the tunable delay and was shown to maintain a frequency within 1% of the chosen value. To make the clock pausable, an ME element is added to the ring as shown in figure 14(a). This arbitrates between the rising edge of the clock and an incoming request. Hence the clock is prevented from rising as the input registers are being enabled by the request and metastability is prevented. For each bundle of data a port controller, request and ME element is required. Only when all of the ME elements have been locked out by the clock is the rising clock edge permitted to occur.
Fig. 14. Pausable clock
By increasing die size and clock frequency, synchronous methodology is challenged by clock skew problem. Removing the clock signal enables asynchronous systems to handle such problems caused by clock signal. However, their design is complex because of the lack of notifies the port controller that the clock pulse is disabled. Intermediate solutions may be found between the totally synchronous and totally asynchronous methodologies, namely, the globally asynchronous locally synchronous (GALS) methodology. A GALS system consists of synchronous blocks communicating with each other asynchronously. To simplify designing such systems, synchronous blocks are enclosed in asynchronous wrappers.
Surrounding a synchronous module with an asynchronous wrapper makes its external interface completely asynchronous. Each data vector entering or leaving a module is accompanied by a request-acknowledge pair of handshake signals (bundled data). To signal validity of data we use a four phase protocol with so called broad interpretation . This means that a handshake cycle comprises four sequential events (Req+, Ack+, Req-, Ack-) and data are guaranteed to be valid between Req+ and Ack-. By keeping the internal organization of the wrapper modular we obtain the chance to assemble asynchronous wrappers from a library of predesigned modules. This makes the construction of GALS systems fast and safe. Our library contains pausable clock generators and port controllers, where the latter can be subdivided into two families: demand-ports (D-ports) and poll-ports (P-ports).
The asynchronous wrapper can be realized in many different ways. Liljeberg et al.  use a synchronous/asynchronous interface unit (SAInt unit) containing dual-port RAM as a FIFO queue. This structure is targeted at ULSI but can contain significant overhead for smaller applications. Muttersbach et al.  uses hazard-free two-level AND-OR circuit for handshake, which also was implemented on chip. Bormann and Cheung  uses, as Muttersbach , a modular wrapper but only with different ports as modules. Njølstad et al.  describes a wrapper that also supports locally rate-adapted dynamic voltage scaling for the module.
Figure 15. Asynchronous wrapper and synchronous module.
Figure 2 shows an asynchronous wrapper and a synchronous module. Through rate requirement the local power supply voltage can be scaled down so that the synchronous module process data at the rate data arrive. By scaling the power supply voltage the speed will also be scaled, i.e., the speed of the circuit (VDD) depends on power supply voltage. The synchronous module can be generated by a state-of-the-art synthesis tool and inserted into the wrapper and the wrapper will then convert from to the voltage outside the wrapper to from the right voltage for the synchronous module. From the designers point of view, a simple and structured way to introduce the asynchronous wrapper into the system architecture is needed. The asynchronous interface should be modular to allow for simple control of its features, modularity also allows the designer to choose what modules to include in the wrapper.
To minimise the metastability issues and Global clock skew issues GALS methodology introduced different synchronous modules with their local clock speeds with asynchronous handshakes i.e. Stretches  or pauses  the clock until the synchronous block is ready. Each set of data entering or leaving synchronous regions knowledge of pain handshaking signals. These systems are allowed to the
Figure 17.a: Asynchronous wrapper block diagram
Synchronous block is connected with other blocks or communicating with other blocks, but having own clock optimisation it is very difficult communicate because the clock skew will become occurs in synchronous block so we need to local synchronous Block becomes communicating the other blocks with asynchronous and internally it would be synchronously. Asynchronous wrapper provides the interface which tolerates Heterogeneous systems of safely constructed synchronous and asynchronous elements. The asynchronous wrapper is consist of the input control, output control and clock generators. We would discuss these components;
Input Control block: Input Control block consisting the request and acknowledge elements with the data input. In this block these signals are controlled by 2 or 4 phase handshaking communication signals it is also send the request and grant signals to clock generator to release the clock signals. The total operation will be performing using handshake communication using of Muller 'C' elements and Mutual exclusion (ME).
Clock generators: the clock generators totally under control of the input and output port controllers. When input controllers are asserted the request signal clock is generated then it will sent back the grant signal. Next is clock is asserted to the synchronous blocks which enables the data in and perform its task. Again for stoppable the clock need a request signal from the output port which pauses the clock signal and its send to grant signal data output travelled to the output port. The all operations on handshaking protocols may be its 4- phase or 2 -phase dual rail/ single protocols.
Figure 17.b: clock generation of the circuit
Output port : The output port also similar to input port where request and grant signals performed their asynchronous handshaking , it sends the request signal to the clock generator for stopping the clock and when clock has been stopped it sends grant signal to the output port. It stores the data out, also have maintained the communication between other channel with Asynchronous Hand shaking the signal and send with the data output of the Local Synchronous block via Asynchronous Communication Channel or FIFO buffers or direct point to point connection .
FPGA IMPLEMENTATION OF GALS WRAPPERS
GALS wrapper circuits operate asynchronously whereas FPGA circuits are designed for synchronous systems. Therefore, some considerations must be taken into account for mapping wrapper circuits on FPGA systems. Asynchronous design methodology used in this research is called Speed Independent (SI) method and, Petrify is our synthesis tool. Fortunately the netlist generated by pertify is in complex gate format that can directly mapped on FPGA LUTs.
Complex gate circuits should have following attributes:
They must be hazard free at least for an input transition.
All input to output paths must have the same delays. In most commercial FPGAs such as Xilinx Virtex, these assumptions are valid for LUTs . Hence, we have assumed that LUTs can be considered as complex gates for implementation of port controllers. The benefit of this approach is that the Petrify synthesized asynchronous circuit can be directly mapped into LUTs with minimum overhead. T0 be able to satisfy isochronic fork requirement, we can use mapping and placement implementation constraints . In pausible clock based wrapper, the LS module can be mapped into LUT blocks easily by contemporary CAD tools but, we must observe some rules in designing clock generator and port controllers. The detailed FPGA implementation of the pausible clock GALS wrappers . Asynchronous port controllers and gate-signal synchronizer circuit are parts of the gated clock based GALS wrapper module that need some extra efforts to be mapped into LUTs. Gated clock based port controllers can be mapped into LUTs by forcing placement and relational routing constraints in the same way stated for pausible clock port controllers. Figure 16 shows the LUT mapped gated clock based input port controller. In practice, the gate-synchronizer circuit shown in Figure 4 is not mappable into FPGA LUTs. Therefore, we generate the final gate-signal from g-signal (port controller's output) according to Figure 17.
Figure 16. LUT implementation of gated clock based input port controller
Figure 17. FPGA implementation ofgate-synchronizer circuit