The ITRS process roadmap and nextgen embedded multicore SoC design - Embedded.com

The ITRS process roadmap and nextgen embedded multicore SoC design

Driven by such general macro trends such as Internet everywhere, IPeverywhere and Seamless Mobility, in its 15-year assessment ofsemiconductor technology requirements, the International TechnologyRoadmap for Semiconductors projects that as technologies and structurespush the limits of Moore's law and productivity, new semiconductorapproaches to scaling and new functionality on- and off-chip will berequired. Figure 1, below shows macro trends for cellular technology.

The semiconductor technologies that will be required can be broadlycategorized into three categories: “Moore” ” Geometric Scaling; “MoreMoore” ” Equivalent Scaling and “More than Moore” functionaldiversification ” all of which will have significant impact on theembedded networking space with new Systems on chip architectures thatmake extensive use of

1) Multi-Core (MC),
2) Cache hierarchy,
3) On-chip fabric,
4) On demand Accelerator Engine (AE), and
5) Connectivity,

all engineered to provide a scalable, software-basedMulti-Core/Accelerator Engine SoC (SOC-MC/AE) solution that targets awide range of applications from ultra low-end to high-end that preserve& extend the user experience through new services.

Figure1. Macro trends for cellular technology.

The three “Moore's”
While technologies and structures push the limits of Moore's Law andproductivity, the ITRS initiated the concept of “More Than Moore,”which first appeared in the 2005 ITRS publication, calls for theintegration of Functionality that does not scale. It is mostly analogfunctionality, but also includes passives, high voltage, sensors,actuators and enablement.

During the ITRS summer conference, an overall definition wasintroduced grouping three aspects of “Moore” concept:

Moore: Geometric Scaling
More of Moore: Equivalent Scaling
More Than Moore: Functional Diversification

While “Moore'sLaw” is mostly focused on geometric scaling in continuingshrinking of horizontal and vertical physical feature sizes of theon-chip logic and memory in order to improve density (cost per functionreduction) and performance (speed, Power) and reliability values to theapplications and end customers.

“More of Moore” is about equivalent scaling which occurs in conjunction with, and alsoenables, continued Geometrical Scaling plus non-geometrical processtechniques that affect the electrical performance of the chip. Thethird element is “More Than Moore” is about functional diversification.

The “More ThanMoore” refers to the incorporation into devices offunctionalities that do not necessarily scale according to “Moore'sLaw,” but provide additional value to the end customer in differentways.

The “More-Than-Moore” approach typically allows for the non-digitalfunctionalities (e.g. RF communication, power control, passivecomponents, sensors, actuators, 3rd party IP/ennoblements) that tomigrate to system board-level/particular package level (SiP) orChip-Level (SoC) potential solution.

There is increasing tendency to have more functions on a chip whichare not scaling according to the same pattern [as defined in Moore's Law ]. This isfunctional diversification rather than scaling, but it's part of thesame business and same technology.

The combination “Moore's Law” and “More Than Moore” enables thecreation of system-on-a-chip and system-in-a-package and, as such, addsvalue to systems rather than just integrating more of the samefunctions on a chip.

Functional diversification in SoCdesign
The ITU-R is currently studying user demand predictions in futuresystems such as the amount of traffic in the year 2010 onwards incalculating required spectrum bandwidth for the future development ofIMT-2000 and IMT-Advanced.

The IMT-2000 (International Mobile Telecommunications) systems are3rd generation mobile systems, which provide access to a wide range oftelecommunication services, supported by the fixed telecommunicationnetworks (e.g. PSTN/ISDN/IP), and to other services which are specificto mobile users. Among the key features of IMT-2000 are:

1) Capability formultimedia applications within a wide range of services and terminals
2) High degree of commonalityof design worldwide
3) Compatibility of serviceswithin IMT-2000 and with the fixed networks
4) High quality
5) Worldwide roamingcapability, and,
6) Small terminal suitable forworldwide use

The next 5-15 years will also mark trends towards:

1) Scalable networks thatdeliver high rich multimedia content at broadband speed anywhere andanytime and on any device;
2) Markets in which theconsumer will play a major role in creating high rich multimediacontent;
3) Emergence of advancedIP-based applications and services that drive high bandwidth scalablenetworks;
4) Complex multi-processingplatforms equipped with multi-core/multi-threading and acceleratorsthat support advanced applications and services;
5) Advancement in processtechnology from 65-45-32, 22 and sub 10nm technology
6) Scalable encryption andantivirus everywhere in the network;
7) Home networking will be acomplex network converging data communications, entertainment;
8) Seamless mobility in thehome, in the office/vertical market, on the road

In contrast with PC & Server Applications, and due to thefundamental difference between core speeds and memory/IO latencies,today's embedded processor architectures are unable to delivermeaningful performance for the connected computing scenarios outlinedearlier.

Nearly every commercially available integrated general-purposeprocessor shipping in volume today is designed using a single-threadedarchitecture, which is performance and application limited by today'sstandards. As applications are becoming more and more network-centric,this legacy processor design approach fails to address the throughputrequirements of today's converging compute and networking paradigm.

This evolving packet-oriented environment is characterized by highmemory access latencies, which are not effectively managed byconventional processor architectures. This weakness can severely impactprocessor performance and workload efficiency. When a memory accesscannot be serviced immediately and no additional instructions are readyto be executed, conventional processors stall and waste valuableprocessing cycles.

SOC-PE Consumer & SOC-MC/AENetworking Architectures

Adding “More of Moore” to the mix (Table1, below ) provides a converged/ integrated heterogeneousplatform (Figure 2, below ) thatmakes possible the creation of a scalable, intelligent, compactvalue-add ecosystem. This SoC-PE/Platform implementation based on thescaling achieved by the use of the 3-Moore's is becoming an importantparadigm moving forward.

Table1. SOC-MC/AE Networking Platform and Moore's classification

Early in 2005, ITRS introduced SOC-PE Architecture Template, where aPE is a processor customized for a specific function targeting portableand wireless applications such as smart media-enabled phones or digitalcamera chips, but also high-performance computing and enterpriseapplications.

To complement this SOC-PE architecture, a Multi-Core/AcceleratorEngine SoC Architecture template is defined to address the networkingembedded space as shown in Figure 2below . As shown, the MC/AE SOC networking platform contains thenecessary building blocks to:

1) Support Multi-Core (MC)for high processing performance within 30 watt power envelope,

2) Support an unprecedentedTri-level Cache Hierarchy, with back-side L2 caches, multiple L3 sharedcaches and multiple memory controllers

3) Support high-speedinter-connectivity

4) Introduces a scalableOn-chip Fabric for concurrent, non-blocking, hardware-based 100%cache-coherent platform connectivity which scales to more than 32 coresand supports heterogeneous cores.

5) Eliminate shared buscontention and supports dramatically higher address issue bandwidth to”feed” multiple cores

6) Include an On-demandAcceleration Engine (AE) that offers performance advantages over purecore processing cycles, enables lower power implementations and reducessilicon area / cost

7) support a ybridSimulation Environment combining cycle-accuracy and functional-accuracythat enable ease of software development , performance prediction andoptimization

8) Network/SystemEnablement & Ecosystem looking into software partitioning andvirtualization leveraging multi-core hardware architecture

Figure2. A Multi-Core/Accelerator Engine SoC Architecture template defined toaddress embedded networking apps

The MC/AE SOC network platform contains the necessary buildingblocks to provide a scalable, software-based solution and addresses awide range of applications from ultra low-end to high-end that preserve& extend the user experience through new services.

Multi-Core (MC). The Multi-Core's frequency in a wide multi-core product will betargeted over one GHz. This platform targets the highestinstruction-per-cycle (IPC) and highest frequency for a given watt perarea.

The MC's are also designed to offload repetitive and computingintensive operations to high-performance acceleration blocks,increasing the number of processing cycles for higher throughput or newservices and applications.

Each MC core in the platform will have its own L2 backside cache.Backside cache is connected to the CPU through a direct channel,enabling extremely high application performance.

It allows the cache to match the full speed of the CPU, resulting inlatency improvements well over 50 percent of “shared bus/shared-cache”architectures. L2 backside cache also enables tuning the contents ofthe cache between instruction and data, according to differentapplication needs, easing partitioning and improving performance bydrastically reducing CPU stalls.

In addition, the L2 backside cache reduces traffic on the on-chipfabric and main memory, which reduces latencies and improves bandwidthfor other users of the fabric and system memory.

Multithreading and multiprocessing are closely related. Indeed, onecould argue that the difference is one of degree: Whereasmultiprocessors share only memory and/or connectivity, multithreadedprocessors share those, but also share instruction fetch and issuelogic, and potentially other processor resources.

In a single multithreaded processor, the various threads compete forissue slots and other resources, which limit parallelism. Some”multithreaded” programming & architectural models assume that newthreads are assigned to distinct processors, to execute fully inparallel.

Cache Hierarchy. Recognizing the limitations of existing processors that rely on ashared cache model, a new approach calls for incorporating athree-tiered cache hierarchy into the MC Networking Platform. L1 cacheis retained on the core.

As previously mentioned, L2 cache is attached to the cores as abackside implementation that can significantly improve performance.Each core has own back-side L2 cache that provides:

1) an aggregate bandwidththat could never be sustained by a single shared cache

2) Results in latencyimprovements vs. front-side (shared) cache

3) Back-side cache enablestuning of policies by core(s) according to different work sets foreasier implementation of performance, isolation, priority, and QoS

4) A private cache is moreself-contained (vs. a single shared cache) and can serve as a naturalunit for resource management (e.g., powering off to save energy)

However, there are some tasks for which a shared cache is desirable,such as inter-processor communication and operating on shared datastructures. For those instances, we are also providing a multi-megabyteL3 cache. This high-bandwidth, shared cache maximizes hit-rates whileproviding fast memory access for input/output (I/O) and acceleratorblocks.

On-Chip Fabric. The on-chipfabric works in concert with the caching hierarchy to enablecache-coherent and concurrent accesses. The innovative backside cacheimplementation combined with the fabric is designed to enable datareplication, modified intervention and full hardware coherencetracking.

The MC Networking Platform will employ highly scalable and modularon-chip fabric, the result of multi-year research and development,which enables cache-coherent, concurrent, low-latency connectivityamong cores.

Unlike a shared bus as interconnecting medium among cores, memoryand peripherals, the on-chip fabric helps to reduce the bus arbitrationand contention issues that other multi-core architectures face as moretraffic is introduced into the system. It behaves like a mesh, allowingconcurrent traffic to enter and exit the system from any point withinthe fabric rather than through a single point.

Inherently scalable, the fabric is designed to sustain multiple,fully-coherent transactions every cycle and easily expand toaccommodate more cores. On-Chip fabric also supports the option forheterogeneous clustering, allowing full portfolio of MCs, which spans awide range of power and performance design points, to be mixed andmatched in a product with full coherency among the cores.

Connectivity . The MCNetworking Platform integrates an extensive set of networking and I/Oresources to support its high-throughput architecture. This sectionfocuses on these resources, which provides system designers a widerange of choices for scalable, high-performance systems.

SOC-MC/AE Networking PlatformInterfaces and building blocks
The SOC-MC/AE Networking Platform supports multiple interfacesincluding RGMII, XGMIII, and SPI-4.2 Interface controller. Additionalhigh speed interfaces include: PCI-X interface and serial RIOinterfaces.

Peripherals Interface. Peripheral devices and ROMs are connected to the MC Networking Platformthrough the various ports of the Peripherals Interface. The ports arecreated with different combinations of a 32-bit Peripherals I/O Bus andthe programmable General-Purpose Input/Output (GPIO) signals.

The MC Networking Platform has essential standard busses such asstandard I2C bus ports where each consists of two bidirectional buslines; the Serial Data (SD) line and the Serial Clock (SCLK) line.

On-Demand Accelerator Engine (AE). On-demandacceleration provides accelerator Engines (AEs) technologies to take MCnetworking architecture to a new level of performance and flexibility.An asynchronous, shared-resource architecture enables lower-latency andmulti-task handling without the overhead of thread switching.

On-demand application acceleration offers performance advantagesover pure core processing cycles, enables lower power implementationsand reduces silicon area thus reducing cost. On-demand,high-performance Accelerator Engine's (AE's) technologies include:

1) Pattern matching for deeppacket inspection and full content processing
2) Decompression/Compression tounpack data for inspection and pack it for delivery
3) Crypto security forconfidentiality, integrity and authentication
4) Table lookups for packetparsing and flow classification
5) Data path resourcemanagement to efficiently allocate on-chip resources
6) Packet distribution andqueue management

Hybrid Simulation Environment
The SOC-MC/AE networking platform will require full system simulationmodel, a hybrid that combines cycle-accurate modeling technology withfunctional modeling technology that enables ease of softwaredevelopment, performance prediction and optimization of customerapplications for the MC networking platform.

Using the hybrid simulation environment, which allows easy switchingbetween functional and cycle accurate models, developers will be ableto migrate and partition operating systems, middleware and applicationsonto the virtualized MC networking platform for development, debuggingand benchmarking – even prior to silicon availability.

The environment also enables safe and easy experimentation withpartitioning, parallelizing and optimizing systems and applications.Software developers can perform “what if” scenarios and tune theperformance for specific situations without real-world hardwareconstraints. The hybrid simulator provides a programmer's view of thehardware, and features the following elements:

1) A fast, functional modelfor the MC networking platform
2) A detailed cycle-accuratemodel of the MC networking platform
3) A comprehensive package withinfrastructure and tools for software development, code partitioningand debugging, profiling and visualization
4) Visibility into system stateboth architectural and micro architectural including caches andregisters pipelines.
5) Run-time control ofexecution software including break pointing, stepping and reverseexecution
6) Ability to boot multipleoperating systems

A major advantage of a hybrid simulator is its ability todynamically switch back and forth from a high speed functional mode toa more detailed cycle-accurate mode.

This allows software developers to quickly boot an operating systemand execute code at critical points and then switch to the moredetailed cycle accurate mode to analyze specific areas of interest – nomore waiting days for results.

As a development platform for multi-core systems, the hybridsimulation environment is designed to enable an extensive amount offlexibility and experimentation in a non-invasive environment – noinstrumentation is needed in the operating system or application.Software developers are able to decrease bring-up time for the targetsystem all while improving the overall quality of their code.

The MC/AE Enablement Ecosystem (EE)
MC/AE Networking platforms will require software engineers to spendsignificantly more time thinking about software architecture.Exploiting the performance potential of MC processors means embracingparallel processing, which can be a challenge given the long andsuccessful history of single core systems that are largely selfsynchronizing.

Networking applications offer coarse grained parallelism in the formof packet processing, and the interactions between a networking datapath and the control plane are sufficiently decoupled to create anadditional level of parallelism.

While this immediate parallelism is easy to envision, things getinteresting when the performance requirements of a data path flowexceed a single CPU's capabilities, or when a single core can't providesufficient control plane responsiveness. Load balancing and mixedasymmetric/symmetric multi-processing environments on the same deviceare challenges that MC Networking Platform is designed to address.

While software architects are thinking about distribution of tasks,the processing densities offered by MC Networking Platform will causehardware architects to think about consolidation and re-partitioning offunctions that have been distributed across discrete CPUs or modules.

These decisions will interact strongly with the introduction of newservices and capabilities in the system. For both software and hardwarearchitectures, there is a need for a great deal of flexibility in amulti-core processor and for good mechanisms to help facilitateexperimentation with future architectures.

The SoC-MC/AE networking platform implements cores, each with theirprivate L2 cache, also known as backside cache. In addition, theplatform is equipped with on-demand accelerator engine that can beapplication specific.

While the Multi-core Platform is designed with aggressiveperformance targets, ease of use has also figured prominently in theplatform definition. One of the significant obstacles in multi-coreimplementations today is programming efficiency and debugging. The twomost likely scenarios (shown in Figure 3 below) are::

Scenario 1:Number of cores are normalized to 1-core in 2007 and System performancenormalized to 1-core in 2007.

In this scenario, system performance at 45nm delivers 3.6x theperformance at 65nm that required 3.7 cores against 1 core at 65nm.Similarly, at 32nm, system performance is 13.5x performance with 7.5cores compared to 1 core at 65nm. The graph shows that performance islinear.

Scenario2: Number of cores are normalized to 4-core in 2007 and Systemperformance normalized to 4-core in 2007.

In this scenario, system performance at 45nm delivers 14.7x theperformance in at 65nm that required 10.9 cores against 4 cores at65nm. Similarly, at 32nm, system performance is 54x performance with 30cores compared to 4 core at 65nm. The graph in Figure 3 below shows thatperformance is linear.

Figure3. Two likely scenarios for the evolution of multicore SoCs as processtechnologies move from 65 down to 32 nanometer and below geometries.

The SOC-MC/AE Platform ValueProposition
Tomorrow's networking needs can no longer be met by increasing theoperating frequencies on single-core architectures. Adding cores (MCs)will improve performance (Geometric Scaling).

But thermal management challenges, in the embedded space, areoverwhelming the performance improvements achievable by increasing CPUfrequency. Hence the need to look at the challenge from the SOCPlatform perspective.

There may be contention for bus bandwidth and memories, scalabilityproblems, and perhaps even worse, unused processing cycles due to lackof programming visibility.

Adding Accelerator Engines (AEs) will continue to add incrementalimprovement to performance (equivalent scaling) in the context ofSOC-MC/AE networking platform. But leveraging the hardware requiregreater investment in software enablement and simulation environment(Functional Diversification).

Thus, SOC-MC/AE Networking Platform is not only designed to providesuperior performance and energy efficiency, but also to help make thetransition to multi-core processors as quick and as painless aspossible with an industry leading enablement ecosystem.

Thus, Multi-Core (MC), Accelerator-Engine (ME) and Simulation/Enablement/ Ecosystem (SEE) are three ingredients that will change thelandscape of networking and will deliver a scalable & sustainableperformance to meet next generation advanced application and services.

FawziBehmann is the Chair of the Marketing Committee of Power.org,the open community driving collaborative innovation around PowerArchitecture technology, Fawzi is working with member companiesadvancing the roadmap of the Power architecture and ecosystem. Fawzi isalso chairing the networking System Drivers Working Group at ITRS(International Technology Roadmap for Semiconductors) and is presentlycontributing in defining networking platform that will address a newclass of advanced applications and services in the coming 10-15 years.He has also been the Director of Strategic Marketing for the NetworkingSystems Division within Freescale’s Networking and Multimedia Group.

References
1. ITU-R M.1645 “FutureDevelopment of IMT-2000 and IMT-Advanced”, WG Spectrum, Document 8Frevised draft, July 2007
2. IEEE Communications, “WebServices in Telecommunications”, “Orchestration in Web Services andReal-Time Communications”, July, 2007 PP. 26-27, 44-50
3. IEEE WirelessCommunications, “New Generation Heterogeneous Mobile Networks”, April2007, PP 2-3
4. IEEE WirelessCommunications, “The Multiple Access Scheme for WirelessCommunication”, June 2007, PP2-3
5. IEEE WirelessCommunications, “Next Generation CDMA vs OFDMA for 4G WirelessApplications”, June 2007, PP 6-7
6. IEEE WirelessCommunications, “IFDMA: A Scheme Combining the advantages of OFDMA andCDMA”, June 2007, PP 9-17
7. Communications News “Enterprise Network Solutions “Are you ready for converged IP?”, July2007 PP40-41
8. Semiconductor International,”Semicon West 2007″, June 2007, PP20
9. Mobile Enterprise “Connecting Enterprise Solutions to Business Strategy, “BetteringBehavior, Mobile Tools”, July, 2007, PP8, 19-25
10. EE Times, “Freescale CEO:IC growth drivers shifting”, July 2, 2007, PP8
11. IEEE Micro, “Hot Chips 18”,March-April, 2007, PP 7-9, “The AMD Opteron Northbridge Architecture”,PP 10-21, “The Blackford Northbridge Chipset for the Intel 5000”, PP22-33, “ARM996HS: The First Licensable, Clockless 32-bit Processorcore”, PP. 58-68
12. Power Architecture ” CellBE, “Cell Microprocessor”, Wikipedia
13. IEEE Computer Society,”Synergistic Processing in Cell's Multi-core Architecture”, 2006,PP10-24
14. ACM, “Evolution of LowPower Electronics and its Future Applications”, ACM, 2003, PP2-5
15. IEEE Comp Society, “CMOSScaling for sub-90 nm to sub-10 nm”, 2004, PP1-6
16. IEEE Journal of SolidState, “CMOS Technology ” Year 2010 ad Beyond”, 1999, PP 357, 366
17. IEEE ” Proceeding of 8thIPFA 2001, “Direction of Silicon Technology from Past to Future”, 2001,PP 1-35
18 . ITRS 2005 Publication “Introduction of “More than Moore” concept
19 . ITRS 2007 Summer WorkingGroup Workshop/Public Conference ” Work in progress on “more thanMoore”
20. SemiconductorInternational, July 18th ITRS Summer Conference ” Panel Focus on “MoreThan Moore”, by Peter Siger, Editor-in-chief
21. ITRS 2007 System DriversPublications, Networking Driver ” SoC Multicore/Accelerators Platform ,Pages 3-5

To learn more about designingembedded systems designs using multicore technology, check out a numberof classes and presentations on this topic at the EmbeddedSystems Conference Silicon Valley 2009.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.