Optimizing clock tree distribution in SoCs with multiple clock sinks - Embedded.com

Optimizing clock tree distribution in SoCs with multiple clock sinks

In the design of high-performance high-speed integrated circuits, clock tree organization is fundamental to distribution of e-clock signals to the whole area of an integrated circuit or to a predefined part of it.

In this article we describe a structure and a method for propagating clock signals to a multiplicity of clock sink nets in a system-on-chip (SoC) design. We include an improved buffering and wiring apparatus that allows reduction of the number of clock stages, the overall latency, the clock skew, and uncertainty.

The problem of clock distribution from root (PLL) to sinks (FlipFlops) is addressed, using two phases: (1) top level optimal distribution and (2) local or block based clock distribution. A method for integrating the two phases within an automation system is also described.

The role of clock trees in complex SoCs
The growing complexity of integrated circuit design is leading to several requirements for bringing a layout to completion. Modern technology nodes (32nm and beyond) are challenging because their reduced physical geometries introduce uncertainties due to local variation (random effects) and the impact of parasitics in terms of wire capacitances and resistances.

These variations are typically called on-chip variation (OCV). There are two source classes of variation that must be considered in design: global and local. Global chip-to-chip variations cause performance differences among dies and are modeled as operating corners. Local on-chip variations cause performance differences among transistors within the same die and are modeled as an added derating factor to get skew calculations.

OCV derating is calculated as a certain percentage of the total insertion delay. Consequently, in order to optimize performance of the clock tree, designers need also to take into account structures that are inherently not prone to OCV, while minimizing the overall latency. In this context one of the things that must be considered is the impact of automated methods for the computer-aided creation of a layout of an optimized clock tree circuit. This is of crucial importance because existing software tools tend to arrange the clock tree unfavorably for OCV and latency.

Standard Clock Tree Synthesis engines are driven by timing closure and, hence, are not PVT (process/voltage/temperature) variation aware. They are used to fix setup/hold violations by adjusting the clock skew, adding, removing, and swapping buffers, or exploiting different clock wire lengths and levels and so on. As a result, the skew sensitivity with respect to PVT variations cannot be kept low, since it has several contributors originating from different physical phenomena.

Figure 1: Example of standard CTS tree

One approach to overcoming the impairment of clock skew in chip design, mostly in high speed design, is the clock MESH, as schematized in Figure 2 . The major difference between this method and the standard approach to clock tree synthesis (CTS) is that at a certain level of the tree, the drivers’ outputs are connected to the same metal net, called a mesh net. Such a shorting of several clock drivers enables an averaging and spatial smoothing effect, which reduces the clock skew of the different clock drivers.

Figure 2

Clock mesh technology produces a much lower clock skew compared to a conventional clock tree. Unfortunately, one issue is how to take full advantage of this technique in the standard design flow. Typically a full set of analog simulations are needed to evaluate the residual clock uncertainty on the mesh net before continuing with the timing analysis. A correct skew evaluation needs the layout to be frozen and every alteration of the pre-mesh structure obliges this out-of-the-flow characterization to be run again, making the overall design flow very long.

Another problem with a clock mesh, especially when conceived at top level, is the large amount of power consumed, which means that a dynamic power drop is likely.

Other techniques found in the literature, such as PLL/DLL de-skewing, are not suited for high precision, low uncertainty clock distribution, mainly due to added jitter of PLL/DLL circuitry.

Low uncertainty clock tree structure and method
The approach we have developed makes use of a low uncertainty clock tree for clock propagation in complex digital chip design. Many challenges have to be dealt with, such as very high frequency, low skew, and low insertion delay, in order to close many millions of timing paths in a forceful way and make the design cycle time short and predictable.

Figure 3 shows the typical floor plan of a large, complex chip, illustrating the two steps to accomplishing the overall clock distribution:

1st level clock tree (low uncertainty clock tree [LUCT] )
This is a high-quality balanced tree from top-level root clock net (PLL output) to an intermediate set of clock nets. It’s purpose is to carry clock signals from central PLL for the major part of chip area. In a hierarchical floor plan it may bring clock signals to input the clock pins of blocks.

2nd level clock tree
This has to join the LUCT to the remaining part of design. In a block this may be done with standard CTS CAE tools. However when the flops are at the same hierarchical level of the LUCT, it is done after grouping each flop to one of the LUCT leaves.

Figure 3

Figure 3 shows that the top level clock tree distribution can be disclosed regardless of how the clock distribution at subchip level is implemented. This methodology allows the designer the most flexibility, in terms of multi-user or third part macro (re)usability. Furthermore, by this technique, the connection of registers placed at top level is also allowed.

The design methodology of the LUCT can be divided in 4 main tasks:

  • Multi-driven net-based topology
  • Top level clock planning
  • Algorithm for LUCT
  • Automatic flow for timing analysis

Multi-driven net (MDN)
Multi-driven net or MDN is the basic structure of the presented design method. From a circuit level point of view it is a strong buffer/inverter obtained by designing in parallel “m buffers” (inverting or not inverting, see Figure 4 ). These buffers can be retrieved from the standard clock lib cells, making it unnecessary to create a brand new characterization or library/cell database.

From a layout point of view, there are two features to pay attention to:

  • The layout of buffers/inverters placed on different power rows to minimize power noise effects (Figure 5 )
  • The wiring of clock signals (Figure 6 ) annealed in the power/ground grid, in order to exploit high level, low resistance metal layers and the shielding of existing power/ground, which makes possible a low delay and x-talk immune structure.

MDN topology improves the maximum length of allowable clock wire, enhancing the distance/delay performance of standard clock tree routines. SPICE simulations of the different topologies (m=1, 2, … , 5) are previously run to obtain the max clock wire length compatible

The MDN buffers are characterized in terms of OCV performance based on their usage in the custom high-performance place-and-route algorithm. Main advantages are gained from:

  • Big transistor size, leading to reduced local variation
  • Use only within very balanced (in terms of cell load and wiring) tree leading to global mismatch of about zero.
  • Balance of transition time on different levels leading to a global mismatch of almost zero

Figure 4: Multi Driven Topology for m=1, m=2 and m=5, respectively

Figure 5: Multiple inverter driver, with multiplicity 5, is laid-out in different power/ground rows.

Figure 6 Higher and less resistive metals are used to propagate clock (see also Figure 7 )

Figure 7 Typical CMOS metal stuck-up showing difference between lower metals and higher thicker metals

Top level clock planning

Management of a hierarchical top level
Implementing the top level clock planning to implement the approach we describe requires several steps:

1. Floorplan database loading including possible plurality of partitions and hierarchical levels

  • Position of PLL or external clock source
  • Position of blocks in Top level
    • Block may be already laid out
    • Block clock tree still needs to be synthesized so that the LUCT is done up to the centroid of the block and CTS is left for feeding clock signals to all the flops of the same block. LUCT can be pushed into the hierarchy of the block if needed for timing analysis of the block
  • Position and type of blockages
    • Placement blockage
    • Routing blockage

2. Definition of a set of leaf points for LUCT such that

  • The PLL output clock pin position (or external clock pin) is recorded as starting point of the LUCT
  • For each block that is already laid out and frozen from a design standpoint, the input clock pin position is added as leaf point of the LUCT
  • For each block to be synthesized, the centroid of the same is added as leaf of the LUCT. If the size of the block is too big, it is possible to further partition the block into several areas and subsequently add one leaf point for each of those area (Figure 8b )
  • For each portion of the Top Level Logic, which is important for the number of registers to be fed, a leaf point is added in a centroid fashion

3. Routing for LUCT
Tooptimize insertion delay and skew performance of the LUCT, it isimportant to note that the LUCT is allowed to feed through blockswhenever it is possible and beneficial to do so. Feed-through can bedone with or without buffering, depending on the possible routingblockages (Figure 8a ). Figure 9 shows different phases of the management of a hierarchical top level.

Figure 8a: Buffered feed-through  8b: center of a block as leaf of the LUCT

Figure 9: Management of a hierarchical top level

Algorithm for low uncertainty clock tree
Input Data: Given a set of clock leaf points, clock root, a set of obstacles,timing and slew rate constraints, a library of drivers and layout rulesfor buffering and wiring according to MDN topology
Goal:  implement an automatic flow for generating a top-level clock tree with the following requirements:

  • Blockage avoiding buffering: buffer insertion not allowed inside any obstacles
  • Same depth of level for any leaf point
  • Clock latency optimization
  • Clock skew optimization
  • Process variation optimization

The clock leaves and clock root points are defined as follows:

  • Clock leaves are a set of n points:
    • the clock pins of any blocks or the center of the blocks
    • other clock points at top level
  • Clock root point is the clock source pin (for example the output of the PLL).

AnST-proprietary algorithm is available for clock tree topologygeneration. It is based on balanced paths length and equalized andhomogenous wire routing. It consists of the following steps:

  • Centroid Calculation
  • Searching the farthest point
  • Searching nearest neighbor
  • Forming pairs (S)
  • Merging segment equalization by length
  • Merging point determination
  • Driver Insertion
  • Wire Routing

Oncethe clock tree structure is generated, the layout implementation isachieved by scripted routines in place and route (P&R) CAD toolsuites. Layout configuration is attained by the use of custom rules forhigh performance routing/wiring, via placing and power-noise awareplacement of multiple parallel drivers.

Automatic flow for timing analysis
Theturnaround time (TAT) for complex chips has become a serious issue fordevelopers, and can be discouraging. But this can be mitigated bycreating an automatic flow to report timing from the laid-out LUCT andwrite constraints for the remaining part of design to be developed.

Thisautomatic flow allows the user to analyze the exact property of thebuilt clock tree by means of spice simulation and write constraint to beused in the following top level timing analysis (10, high precisionflow).

The flow also allows a direct insertion in a fullyautomated system for timing analysis, exploiting the AOCV (Advanced OnChip Variation) techniques, which allows the calculation of ad-hocoptimized OCV values for the LUCT clock branches (fully automated flow),which is a very good trade-off between accuracy and TAT.

Pierpaolo De Laurentiis acheived a degree in electronic engineering from University ofl'Aquila, l'Aquila, Italy. He has been working at STMicroelectronicssince 2000, in analog/full-custom design, mainly in the high-speedinterface field. From 2010 on, he has been working in design methodologyfor digital applications, leading a team especially involved withtiming analysis and clock distribution .

Alberto Ferrara acheived his degree in electronics engineering, signals and controlsystems, from University Politecnico of Milan, Italy in 2006. Sincethen, he has been with STMicroelectronics, working in high-speed macrodesign and signal integrity analysis. In 2010, he joined the designmethodology team, mainly involved in low uncertainty clock distributionarchitectures.

Bibliography

Rick L. Dennis, Charlie C. Hwang, Jose L. Neves, Clock distribution network wiring structure , US Pat. 7831946 – Filed 31 Jul 2007 – Issued 9 Nov 2010 – International Business Machines Corporation

Hung-Chun Li, Chien-Chu Kuo, Minghorng Lai, Ming-Chyuan Chen, Method and system for clock tree synthesis of an integrated circuit , US Pat. 7467367 – Filed 27 Oct 2005 – Issued 16 Dec 2008 – Cadence Design Systems, Inc

Charlie Chornglii Hwang, Jose Correia Neves, Phillip John Restle, Minimizing clock uncertainty on clock distribution networks using a multi level de-skewing technique , US Pat. 7941689 – Filed 19 Mar 2008 – Issued 10 May 2011 – International Business Machines Corporation

Habitz et al, Method of generating wiring routes with matching delay in the presence of process variation , US Pat. 7865861 – Filed 22 Apr 2008 – Issued 4 Jan 2011 – International Business Machines Corporation

Heinz Endres, Thomas Zettler, Method for the computer-aided ascertainment of a clock tree and Integrated Semiconductor Circuit , US Pat. 7707529 – Filed 13 Oct 2005 – Issued 27 Apr 2010 – Infineon Technologies AG

David Li, High-Speed Clocking Deskewing Architecture , Master Thesis in University of Waterloo, Ontario, Canada, 2007

D.W.Bailey,B.J.Benscheider, “Clocking Design and Analysis for a 600MHz AlphaMicroprocessor”, IEEE Journal of Solid-State Circuits, Vol. 33(11):pp.1627-1633, November 1998

Zlatanovici et al, Automatic Synthesis of Clock Distribution Networks , US Pat. 8205182 B1 – Filed 22 Aug 2008 – Issued 19 Jun 2012 – Cadence Design Systems, Inc

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.