Optimizing clock tree distribution in SoCs with multiple clock sinks

Alberto Ferrara and Pierpaolo De Laurentiis, STMicroelectronics

March 10, 2013

Alberto Ferrara and Pierpaolo De Laurentiis, STMicroelectronicsMarch 10, 2013

In the design of high-performance high-speed integrated circuits, clock tree organization is fundamental to distribution of e-clock signals to the whole area of an integrated circuit or to a predefined part of it.

In this article we describe a structure and a method for propagating clock signals to a multiplicity of clock sink nets in a system-on-chip (SoC) design. We include an improved buffering and wiring apparatus that allows reduction of the number of clock stages, the overall latency, the clock skew, and uncertainty.

The problem of clock distribution from root (PLL) to sinks (FlipFlops) is addressed, using two phases: (1) top level optimal distribution and (2) local or block based clock distribution. A method for integrating the two phases within an automation system is also described.

The role of clock trees in complex SoCs
The growing complexity of integrated circuit design is leading to several requirements for bringing a layout to completion. Modern technology nodes (32nm and beyond) are challenging because their reduced physical geometries introduce uncertainties due to local variation (random effects) and the impact of parasitics in terms of wire capacitances and resistances.

These variations are typically called on-chip variation (OCV). There are two source classes of variation that must be considered in design: global and local. Global chip-to-chip variations cause performance differences among dies and are modeled as operating corners. Local on-chip variations cause performance differences among transistors within the same die and are modeled as an added derating factor to get skew calculations.

OCV derating is calculated as a certain percentage of the total insertion delay. Consequently, in order to optimize performance of the clock tree, designers need also to take into account structures that are inherently not prone to OCV, while minimizing the overall latency. In this context one of the things that must be considered is the impact of automated methods for the computer-aided creation of a layout of an optimized clock tree circuit. This is of crucial importance because existing software tools tend to arrange the clock tree unfavorably for OCV and latency.

Standard Clock Tree Synthesis engines are driven by timing closure and, hence, are not PVT (process/voltage/temperature) variation aware. They are used to fix setup/hold violations by adjusting the clock skew, adding, removing, and swapping buffers, or exploiting different clock wire lengths and levels and so on. As a result, the skew sensitivity with respect to PVT variations cannot be kept low, since it has several contributors originating from different physical phenomena.

Figure 1: Example of standard CTS tree

One approach to overcoming the impairment of clock skew in chip design, mostly in high speed design, is the clock MESH, as schematized in Figure 2. The major difference between this method and the standard approach to clock tree synthesis (CTS) is that at a certain level of the tree, the drivers’ outputs are connected to the same metal net, called a mesh net. Such a shorting of several clock drivers enables an averaging and spatial smoothing effect, which reduces the clock skew of the different clock drivers.

Figure 2

Clock mesh technology produces a much lower clock skew compared to a conventional clock tree. Unfortunately, one issue is how to take full advantage of this technique in the standard design flow. Typically a full set of analog simulations are needed to evaluate the residual clock uncertainty on the mesh net before continuing with the timing analysis. A correct skew evaluation needs the layout to be frozen and every alteration of the pre-mesh structure obliges this out-of-the-flow characterization to be run again, making the overall design flow very long.

Another problem with a clock mesh, especially when conceived at top level, is the large amount of power consumed, which means that a dynamic power drop is likely.

Other techniques found in the literature, such as PLL/DLL de-skewing, are not suited for high precision, low uncertainty clock distribution, mainly due to added jitter of PLL/DLL circuitry.

Low uncertainty clock tree structure and method
The approach we have developed makes use of a low uncertainty clock tree for clock propagation in complex digital chip design. Many challenges have to be dealt with, such as very high frequency, low skew, and low insertion delay, in order to close many millions of timing paths in a forceful way and make the design cycle time short and predictable.

Figure 3 shows the typical floor plan of a large, complex chip, illustrating the two steps to accomplishing the overall clock distribution:

1st level clock tree (low uncertainty clock tree [LUCT] )
This is a high-quality balanced tree from top-level root clock net (PLL output) to an intermediate set of clock nets. It’s purpose is to carry clock signals from central PLL for the major part of chip area. In a hierarchical floor plan it may bring clock signals to input the clock pins of blocks.

2nd level clock tree
This has to join the LUCT to the remaining part of design. In a block this may be done with standard CTS CAE tools. However when the flops are at the same hierarchical level of the LUCT, it is done after grouping each flop to one of the LUCT leaves.

Figure 3

Figure 3 shows that the top level clock tree distribution can be disclosed regardless of how the clock distribution at subchip level is implemented. This methodology allows the designer the most flexibility, in terms of multi-user or third part macro (re)usability. Furthermore, by this technique, the connection of registers placed at top level is also allowed.

The design methodology of the LUCT can be divided in 4 main tasks:
  • Multi-driven net-based topology
  • Top level clock planning
  • Algorithm for LUCT
  • Automatic flow for timing analysis

Multi-driven net (MDN)
Multi-driven net or MDN is the basic structure of the presented design method. From a circuit level point of view it is a strong buffer/inverter obtained by designing in parallel “m buffers” (inverting or not inverting, see Figure 4). These buffers can be retrieved from the standard clock lib cells, making it unnecessary to create a brand new characterization or library/cell database.

From a layout point of view, there are two features to pay attention to:
  • The layout of buffers/inverters placed on different power rows to minimize power noise effects (Figure 5)
  • The wiring of clock signals (Figure 6) annealed in the power/ground grid, in order to exploit high level, low resistance metal layers and the shielding of existing power/ground, which makes possible a low delay and x-talk immune structure.

MDN topology improves the maximum length of allowable clock wire, enhancing the distance/delay performance of standard clock tree routines. SPICE simulations of the different topologies (m=1, 2, ... , 5) are previously run to obtain the max clock wire length compatible

The MDN buffers are characterized in terms of OCV performance based on their usage in the custom high-performance place-and-route algorithm. Main advantages are gained from:
  • Big transistor size, leading to reduced local variation
  • Use only within very balanced (in terms of cell load and wiring) tree leading to global mismatch of about zero.
  • Balance of transition time on different levels leading to a global mismatch of almost zero

Figure 4: Multi Driven Topology for m=1, m=2 and m=5, respectively

Figure 5: Multiple inverter driver, with multiplicity 5, is laid-out in different power/ground rows.

Figure 6 Higher and less resistive metals are used to propagate clock (see also Figure 7 )

Figure 7 Typical CMOS metal stuck-up showing difference between lower metals and higher thicker metals

< Previous
Page 1 of 2
Next >

Loading comments...

Most Commented