Design Hint: Reduce the clock-tree power drag in your circuit implementation -

Design Hint: Reduce the clock-tree power drag in your circuit implementation

In the modem high speed VLSI design era, a circuit's clock design plays a crucial role in determining chip performance and facilitating timing and design convergence. Clock routing is important in the layout design of a synchronous digital system as it influences correctness, area, speed and power dissipation of the synthesized system.

Drastically increased requirements for high performance and high speed VLSI circuits have posed challenges to the design of high speed clock networks, where minimization of clock delay and clock skew has been a critical problem.

Buffer operations are widely used in designing clock distributed networks. The power dissipated by the clock distribution network can be attributed to charging and discharging of wiring and load capacitances through interconnect resistance and driver resistance and to the static power dissipated, if any, by the buffers

Here, Pi is the static power dissipated by the ith clocked components. C is the Capacitance, f and V are frequency of operation and voltage swing, respectively.

Several methodologies are adopted in the power consumption and the clock skew minimization. Once of the more common techniques is to reduce Clock Tree Power is Clock-gating , a well-known technique for reducing the dynamic power dissipation of a digital circuit. It saves power by shutting off the sequential elements and part of the clock network during an idle state.

This design hint describes a way to reduce Clock Tree Power by using “an indigenous technique for identifying and removing the redundant clock-cells.” Apart from saving circuit power requirements, there are several other benefits from the use of this methodology, including:

1. Decreasing the cell-count,
2. Saving routing resources,
3. Reducing the OCV impact

Current strategy
Current Clock Tree Synthesis strategy tries to build all leaf cells of a clock at the same latency & skew targets. This is done for two reasons:

1. Simple implementation.
2. So that Hold violations are not witnessed later in the design.

The drawback of this implementation is the addition of lots of extra clock buffers in the design. We can remove this redundant logic from our clock tree and hence save on power numbers and cell-count. Normal Clock Tree build design is depicted in Figure 1 below .

Figure 1

Extra Clock Buffers not only occupy the die area but also use up valuable routing resources Also, Clock tree power contributes nearly 40-45% of the total dynamic power in a chip. This is due to the fact the clocks toggle most often in a design. Removing such instances saves routing resources, reduce dynamic power and also help us to reduce the logic in the design.

A New Clocking Strategy
Quiet a few of these buffers/inverters (shown in turquoise blue in Figure 2 below ) can be removed while keeping timing in to account (i.e, we need to make sure that the setup/hold timing through these clock logic cells do not change after their removal. Only then can they be termed redundant.

Figure 2.

Impact on the design latency is also depicted. Now if we compare Figure 1 and Figure 2 we can see that in Figure 2 there are different groups of flops at different latency values. Table 1 below shows the difference in latency numbers in Figure 1 and Figure 2.

Table 1.

The strategy is based on identifying the clock buffers/inverter_pair at the last level. For example in Figure 3 below , BUF2 is the last level buffer. BUF1 is not included since it has a buffer (BUF2) in its fanout.

Figure 3.

Similarly, the inverter pair is shown in Figure 4, below .

Figure 4.

The next step would be to find out the timing slack available (for both setup and hold) through the identified buffer/inverter_pair. If sufficient slack is available, then such buffers/inverter_pair are removed from the design.

After the identified buffer/inverter_pair are removed. (For example, BUF2 in Figure 3), the algorithm can be run since now BUF1 becomes the last level buffer/inverter. Such iterations are repeated 3-4 times to get a completely optimized clock-tree.

Figure 5. Implementation flowchart.

An advantage of this flow, shown in Figure 5 above is that it can be run at various design stages, including:

1. Placement
2. Clock Tree
3. Routing
4. Cross-talk

Care should be taken to determine where to use this methodology in the design cycle, determining whether as early as possible is best or whether it would better serve design needs to be done further down the development cycle.

The main objective of this approach is to provide an improved clock tree synthesis methodology that reduces power consumption. To achieve the objective, the clock tree is done in a way that allows clock logic cells (Buffers/Inverters) not involved in the design timing to be removed. Finally, a modified low power clock tree netlist, which satisfies timing specifications is obtained.

Sunit Bansal is a senior design engineer at Freescale Semiconductor focusing physical design activities till timing closure (Placement, Routing, Crosstalk, STA). He holds a bachelor's degree in electronics and communications from Delhi College of Engineering (Delhi, India).

Kapil Narula is working toward a Master's degree from University of Carolina. Previously he was a design engineer at Freescale Semiconductor focusing on Placement, Routing, Crosstalk. He completed degree in electronics and electrical engineering from Thapar Institute of Engineering and Technology (Punjab, India).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.