Looking for new SRAM options in embedded ASIC and SOC designs - Embedded.com

Looking for new SRAM options in embedded ASIC and SOC designs

Static RAM memory blocks based on traditional six-transistor (6T) storage cells havebeen the workhorse of developers of the ASIC/SoC implementationsused in many embedded designs, since such memory structures typicallyfit right into the mainstream CMOS process flow and don't require anyadditional process steps.

As shown in Figure 1a below ,the basic cross-coupled latch and active load elements form the 6Tmemory cell and that cell can be used in memory arrays ranging incapacity from a few bits to multiple megabits.

The memory arrays can be designed to meet many different performancerequirements depending on whether the designer opts to use a CMOSprocess optimized for high performance or low power. High-performanceprocesses can yield SRAM blocks that have access times well below 5 nsin a 130 nm process, while low-power processes typically yield memoryblocks that offer access times of 10 ns or slower.

Figure1a : Typical six-transistor static RAM memory cell

The static nature of the memory cell keeps the amount of supportcircuitry to a minimum, requiring just address decoding and enablesignals to design the decoder, sensing, and timing circuitry.

As feature sizes shrink with each more advanced process node, staticRAMs built using the traditional six-transistor memory cells candeliver shorter access times and smaller cell size,

But, also, as feature sizes shrink, leakage currents and sensitivityto soft errors increase and designers may have to add additionalcircuitry to reduce leakage and provide error-checking and correctioncapabilities to “scrub” the memory for soft errors.

Limitations of current 6T SoC RAMcells
However, the large size of the 6T cell due to the six transistors usedto form the latch and high-impedance loads may limit the number of bitsthat can be economically implemented in the memory array.

That limitation is mostly due to the area consumed by the memoryblock and cell leakage based on the technology process node (130, 90,65 nm) used to implement the chip design. As the total area of thememory arrays grows as a percentage of the overall chip area, the size,and thus the cost of the chip will increase as well.

The leakage current may also exceed the total power budget or limitthe application of 6T cells for portable devices. The larger or highleakage chip may not end up meeting the targeted price point for theapplication and thus may not be an economical solution.

Figure1b: Typical single-transistor/single-capacitor dynamic memory storagecell.

1T alternatives to 6T RAM cells
There is an alternative for applications that require large amounts ofon-chip storage ” typically more than 256 kbits ” but dosen't requirethe absolute fastest access time. The solution consists of memoryarrays that work like SRAMs but are based on aone-transistor/one-capacitor (1-T) memory cell such as used in dynamicRAMs (Figure 1b above ).

Such memory arrays can deliver two to three times the density in thesame chip area as a 6T-based memory array. Simple dynamic RAM arrayscan be used when embedded memory requirements exceed several megabits,but such arrays require that the system controller and logic be awareof the dynamic nature of the memory and take an active role inproviding the refresh control and timing signals.

The alternative to embedding a simple DRAM memory block is to wrapthe DRAM array with its own controller to make it appear like thesimple-to-use SRAM array. By combining the high-density 1-T storagecells with some support logic that provides the refresh signals, thedynamic nature of the memory cells is hidden to the ASIC/SoC designer,and designers can treat the memory block as if it were a static RAMwhen implementing their ASIC and SoC solutions (Figure 2 below ).

Some companies and foundries have developed 1-T cells that requireadditional mask layers in addition to the standard CMOS layers. Such anapproach increases the wafer cost and is foundry-specific, thuslimiting the fabrication to a specific foundry. To justify the extrawafer processing cost, the total DRAM array size used in a chip musttypically be more than 50% of the die area. Also, most of the offeredDRAM macros are hard macros with limited size, aspect ratio andinterfaces.

Figure2: The addition of control and interface support logic around a DRAMmemory array makes the array appear to operate like a static RAM, thusdelivering improved memory density

What's required for SoC design is a more cost-effective IP macrothat can easily beprocessed in any fab or transferred from one fab to another for cost orcapacity reasons. That macro should alsooffer more flexibility to the ASIC designer when it comes to layout andconfiguration.

<>Such an approach, called 'one transistor SRAM', is availableforseveral foundries as licensable intellectual property. One suchcompiler-driven method isavailable in bulk CMOS with no additionalmask steps for 15%-20% lower wafer cost and faster time to market.

The resulting memory block interface looks just like a static RAM tothe rest of the system, but achieves about two to three times thedensity (bits/unit area) vs memory arrays based on the 6-T cells (afteraveraging in the support circuitry overhead as part of the areacalculation). The larger the memory array, the less the overall arearequired by the support circuitry and more area-efficient the memoryblock will be.

To create the desired memory array, memory compiler tools, such asMemQuest , are available whichallowdesigners to configure the cooler, faster, or denser, coolSRAM-1Tconfigurations that are portable across foundries and technology nodes.(Figure 3, below ), thusavoiding non-recurring engineering fees for manual arrayimplementation.

The compiler also enables customers to use the most optimum coresize, interface and aspect ratio with the shortest time to market, andprovides designers with electrical, physical, simulation (Verilog and VHDL), test, and synthesis viewsofthe memory array it compiles.

Figure3: The  portable coolSRAM-1T was designed for extremelylow-power operation through the use of adaptive circuit sizing, Virtualgrounding, adaptive back biasing, and other  circuit techniques tolower leakage current. Furthermore, in the coolSRAM-1T cell structure,attention has been paid to  minimizing junction and sub-thresholdleakage current.

In a 1-Mbit memory array instantiation, a coolSRAM-1Tconfiguration, for example, has a leakage current is afew microamps at room temperature and typical corner specs forsupply voltage and clock rate (Figure3, above ).

At a typical refresh rate of 100 kHz or less with a 128 kword by8-bit organization, the 1 Mbit coolSRAM-1T array has an idle power withdata retention comparable to that of a similar-capacity SRAM. (A 1-Mbitinstance of the coolSRAM-6T occupies an area of about 2.6 squaremillimeters and consumes less than 100 microwatts per Megahertz whenthe memory block is fabricated in a 130 nm G process from TSMC.)

Although the SRAM-1T functions like an SRAM, it does have DRAMcharacteristics on the inside—at room temperature when implemented in a130 nm process, the memory cell can retain data for tens ofmilliseconds. The supporting refresh control logic transparentlyprovides the refresh and will adjust the refresh period based on thetemperature.

Designers can also opt to bypass the refresh controller in thememory array and provide their own refresh signals from the SoC logicif they want the SoC to manage the refresh. This can potentially savesome dynamic power on the SoC since the system logic can operate on an”as-needed” basis rather than on an “automatic” basis for the SRAM-1T'sembedded refresh logic.

The memory cells in the SRAM-1T instance also support sleep andstandby modes. During sleep mode, the clock to a large percentage ofthe memory array is suppressed to drastically reduce power consumption.

When the array is “awakened” data must be reloaded into the memorycells. During the standby mode, the memory retains data by using alow-frequency refresh operation that dissipates minimal power. Whenbrought back to active mode, the memory is ready for use; data does nothave to be reloaded into the memory array.

Designers can also configure the memory array to refresh in variousrow sizes – 256, 512, 1024, or 2048 bits, or even refresh multiple rowsin parallel. This allows the designer to provide selective refresh toonly a portion of the array to keep critical data “alive” whilepowering down the rest of the array.

With any memory array there is always the chance that manufacturingvariations will result in a bad bit or two in the memory array. Ratherthan discard the chip, our designers added both column and rowredundancy schemes to enhance yield.

A built-in self-repair capability, used in conjunction with one timeprogrammable coolOTPmemory, can be employed to repair the memoryarray if bits fail once the chip has been shipped. Optionally availableis a built-in self-test capability that can be added to the memory IPblock with no performance degradation.

Figure4: In a typical SoC design, wide internal memory buses can be used torapidly transfer time-critical data for graphics and DSP operations.

When the basic performance of the memory array doesn't meet thesystem needs, there are some architectural techniques designers can useto achieve higher performance from the memory array. However, thesetechniques will have a price ” they will impact chip power, size andcomplexity, so a careful tradeoff analysis must be done to determinethe optimum combination of memory array and chip architecture toachieve the desired performance and cost goals.

One available technique for chip architects would be to use awide-word architecture that might have the memory organized to delivera 128, 256, or even 1024-bit-wide data word internally and thenmultiplexed down to the desired word size (Figure 4 above ).

This technique can double or quadruple the apparent clock rate, thusreducing the effective access time and substantially reduce the powerconsumption. The penalty in this case might be the area impact on theIP design due to the demultiplexing logic needed to reduce the wideword down to the appropriate-sized words for the rest of the SoC touse.

Figure5a: Multiple memory instances (banks) can be interleaved by adding someadditional control and timing circuits to double, triple, quadruple,etc. (depending on the number of banks) the data rate to the hostprocessor.

Another option would be to split the memory into multiple instances(banks) and set up a memory controller to alternately access theinstances in consecutive cycles so that some of the access time ishidden by switching between the banks (Figure5a above ).

In a non-interleaved system, the memory subsystem must operate atthe system clock speed, and that may slow down the system if the memoryaccesses can't keep pace with the clock (Figure 5b below ).

Figure5b: In non-interleaved systems, the memory-bank access time limits thesystem clock speed when accessing the memory array.

However in the interleaved memory approach, the clock frequency canbe doubled, tripled, quadrupled, etc, depending on the number of banks.System complexity, though increases considerably when more than twobanks are interleaved.

In the case of a dual bank system, the clock frequency can be doublethe maximum speed that each memory bank can handle, but since eachinstance is cycled at half of the clock frequency, the individual bankdoesn't see the change in clock speed. (Figure5c below ).

Figure5c: In an interleaved multibank system, the clock can run at a multiple(clock x number of banks) of the non-interleaved clock rate.

Rather, some global logic surrounding the memory banks runs atdouble the memory speed and steers the address information to each ofthe two banks on alternate clock cycles. This global logic can beshared among the multiple banks, thus saving area and power.

Additional logic at the data input/output port multiplexes ordemultiplexes the data to deliver data at double the data rate to thehost system, or delivering data to the banks at half the incoming rate.The effective throughput of the memory subsystem is doubled, yet theactive power is lower than that of a single block with twice thestorage capacity.

Although this approach could cut the access time by close to 50%, itdoes come with the cost penalty of additional support circuitry anddesign/timing complexity. In this approach, the data access from thememory is typically delayed by one cycle (single-cycle latency access)and the access is quasi-random ” the system cannot access the sameinternal bank every cycle.

Cyrus Afghahi, PhD is CEO and co-founder, and Farzad Zarrinfar ispresident of Novelics. Prior to co-founding Novelicsin 2005, Dr. Afghahi was Technical Director for the Office of the CTOat Broadcom Corporation. Previous toNovelics, Zarrinfar was Vice President of Worldwide Sales for StrategicAccounts at ARC International and is a board member for the GSPX DSPconference.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.