 | Level: Intermediate Sam Siewert (Sam.Siewert@Colorado.edu), Adjunct Professor, University of Colorado
04 Oct 2005 A system-on-a-chip (SoC) can provide a single-chip solution, lower power usage, better performance, more frugal use of board real estate, simpler integration, and lower part counts. Compared to multichip solutions, the SoC has huge advantages, but mistakes in sizing on-chip resources require spinning the ASIC and result in high cost. This article introduces approaches for SoC design from a resource perspective. The SoC design concept has appeal in a broad range of computing applications, from supercomputing to embedded systems.
This article is the first in the SoC drawer series. The
series aims to provide the system architect with a starting point and
some tips to make system-on-a-chip (SoC) design easier. The goal for
an SoC is typically a single-chip solution; therefore, properly
sizing memory, I/O, and central processing unit (CPU) resources from the outset is critical.
By comparison, multichip solutions often include approaches for
resource sizing risk mitigation. For example, memory controllers for
external memory devices can support a range of parts and sizes. You can also add coprocessors to multichip solutions as well.
Resource enhancement always has associated cost, but for SoCs, the
cost is higher. This first article provides an overview of design
from a resource perspective; subsequent articles will drill down and
focus on specific methods to support this resource approach. The
SoC drawer series is intended to arm the architect with tools
and methods to get resource sizing right.
The emergence
of SoC design and SoC-based architectures
What is an SoC? For the purposes of this series, I consider an ASIC to be in
this category if it includes CPU, memory, I/O, and an interconnection
between the three. As Wikipedia notes, "System-on-a-chip ... is an
idea of integrating all components of a computer system into a single
chip." The SoC has been talked about, marketed, and accepted in the
new millennium, especially for embedded applications. More recently,
with announcements of high-performance SoC designs such as the Cell
chip, and use of these chips in consumer products including the Sony
PlayStation 3 (PS3) and Microsoft® XBox, it has become clear that SoC
designs will have broad impact. (For more information, see the press on the Cell
architecture this year in the Resources
section below.)
The initial use of SoC design has been focused on reducing part count
for embedded applications and tightening the integration of
processing, memory, and I/O resources with lower overall power
consumption and a smaller footprint. Along with IBM, Sony, and
Toshiba's unveiling of the Cell chip for the PS3, described in detail
at ISSCC this past February, other major announcements have shown
that SoC design has become a pervasive underpinning of many
architecture roadmaps for the future. For example, IBM has also
prototyped a blade server using Cell chips, though Cell experts like
Arnd Bergmann say that Cell-like architectures will remain in the
embedded and supercomputing application domains and won't likely show
up on home or office desktops. (See Resources for an interview with Bergmann.) What is really exciting about SoC
architecture is that supercomputing and embedded computing may become
the cutting edge of computer architecture. For supercomputing this is
nothing new, but embedded systems have often followed rather than led
architecture.
 |
SoC architecture
brings embedded and supercomputing closer together
The IBM Blue Gene®/W and Blue Gene/L recently put the U.S. back
in the lead in supercomputing, ahead of the the Japanese Earth
Simulator, built by NEC and heralded as the fastest in 2002. Each
Blue Gene node is best described as an SoC, given the integration of
processing, cache, and interconnection networks with routing in a
single ASIC. Similarly, by the definition of an SoC, the recently
unveiled Cell chip embedded in the PS3 could even be considered a
multiSoC ASIC. Each Cell Synergistic Processing Element (SPE) in
fact integrates 256KB of load/store memory, processing, direct memory access (DMA), memory management unit (MMU),
and bus interface, with 8 SPEs in all integrated with the Power
Architecture™ technology-based Peripheral Processing Element (PPE) in a single ASIC.
|
|
Any architect involved in embedded or supercomputing is most likely
already working with SoC designs or will be soon. While SoC design
might not presently be well suited to general-purpose computing
(GPC), the roadmap for the future of GPC also holds multiprocessor
designs on a single chip, and therefore includes aspects of SoC
design. Given the reconfigurability, broad range of I/O devices, and
whims of the GPC market, it's unlikely that general-purpose computers
themselves will be designed as SoCs, but clearly they will contain
chips and chipsets that are SoCs. It's not too risky to predict that
most ASICs will be SoCs at some point, and that the SoC is a natural
stage in the evolution of higher and higher integration. The degree
of SoC-ness in a design is based upon the ability to stand alone and
to provide services without requiring support from external chipsets
like external memory devices.
Figure 1. System software and firmware view: The resource
cube
Figure 1 depicts the challenge facing firmware and software engineers
implementing software services on an SoC (or any system, for that
matter). The origin and volume inside the green subspace defines a
resource-rich situation where problems are easily solved with cycles,
bandwidth, and megabytes. The red subspace within the resource cube
defines a resource-constrained situation where significant effort
will be required to tune the system in order to meet timing and
throughput requirements for services. The resource cube does not
include additional dimensions (resources) such as power, pin count,
layout space, or cost. It only portrays the firmware/software view,
given that trade-offs have already been made in the hardware resource
space to define these three basic software resources.
Research on SoC architectures has led to interesting emergent
software and hardware management concepts like dynamic voltage scaling,
where software can modulate CPU clock rate and power consumption
based upon current computational needs. Likewise, hardware might
modulate the CPU to control heating and notify the software layer
that the clock rate has been reduced to handle the overheating. SoC
design might be expanding the hardware and software interface, but the
hardware design decisions defining this space are still not a bad
place to begin high-level architectural definition.
This introductory article looks at some basic hardware and software decisions that affect processing, memory, and I/O. Future articles drill into
individual aspects of the decision-making process and consider
hardware and software trade-offs for a given SoC architectural design
decision. In general, all computers provide services, which can
range from embedded services like digital control, to supercomputing
services like sequencing DNA. Processing is perhaps the most
carefully analyzed resource in most systems.
Processing
resources and scheduling
How compute nodes or CPUs are scheduled depends upon the hardware
architecture and service requirements. Figure 2 provides a taxonomy
of scheduling methods for processing resources. It's not fully
exhaustive, but is fairly complete.
Figure 2. CPU scheduling taxonomy
 |
SoC processors
can host resource-management services
Since the 1970s, real-time systems have included the concept of an
admission policy for services (threads), where a new thread's
service requirements are analyzed relative to existing services to
determine if the new service will cause the existing service to miss
deadlines. Traditionally, this has been done offline, but a
dedicated processing resource could execute a rate monotonic
feasibility test online in an SoC design. Dynamic service admission
is a difficult problem to solve on a single CPU or non-SMT (symmetric
multithreading) processor, since the test itself interferes with
running services.
|
|
Some of these CPU scheduling taxonomy methods are mostly of historic
interest, such as mainframe batch policies like Shortest Job Next
(SJN). Methods such as asymmetric off-loading and dynamic Least
Laxity First (LLF) or Earliest Deadline First (EDF) are, however, of great
interest to modern SoC services for media applications, including
video, audio, and game engines. Describing them all is far beyond the scope of this
article, but this taxonomy provides a context
for future articles discussing methods currently applied to SoC design.
As already noted, SoC design tends to blur traditional hardware and software interface lines, so an SoC architect might want to consider
hardware-supported scheduling for a policy such as EDF. In EDF, the
thread with the earliest deadline is executed until a thread enters
the system with an even earlier deadline. This policy is often used
for soft real-time services like real-time rendering. Recently, symmetric multithreading (SMT)
has emerged to provide hardware support for multiple threads of
execution. Understanding scheduling policies and mechanisms is
critical for the SoC architect.
The simple CPU resource utilization equations below are a starting
point for analysis of processing resources. For systems that have
requirements to provide services without real-time deadlines, called
best effort, Equation 1 provides an estimate of processing
demands for a set of periodic service requests. The non-real-time utility
bound for scheduling simply states that the work queued to a system
over time must be less than full utility; otherwise, the work queue
will grow. Equation 2 provides the same estimate of processing
demands and a basic feasibility test for a set of services that
determines if sufficient processing margin exists for these services
to safely meet their required completion deadlines.
The real-time utility bound (Equation 2) was first presented by
Liu and Layland (see Resources) and is based
upon the observation that most real-time systems provide a set of
periodic services. In the basic test that Equation 2 provides
(the rate monotonic least upper bound), each service deadline
is assumed to be equal to its release period, so instances don't
overlap in time.
Equation 1. Non-
real-time scheduling utilization bound
Sum(Ci / Ti) < 1.0 for threads i=1 to n
|
Equation 2. Real-time scheduling sufficient feasibility bound
Sum(Ci / Ti) < [n * ( 21/n - 1)]
for threads i=1 to n
|
Much more precise real-time feasibility tests have been derived that
have more computational complexity since the introduction of the
simple upper bound in Equation 2 (see Resources for additional reading). In general, scheduling processing
resources requires decision logic for the next thread to execute
(dispatch), a policy for that decision (for example, priority or EDF), and a
test for feasibility. Feasibility provides analysis to determine
whether sufficient resources exist to keep up with a workload or
meet workload service deadlines. SoC architectures often include
multiple processors and symmetric or asymmetric processing and SMT.
Scheduling mechanisms for dispatch, policy, and feasibility are an
important aspect of the design.
I/O
interconnections
Since SoCs include processing, memory, and I/O on-chip by definition,
and most often include multiple processors, the I/O interconnection
on-chip is fundamental to the design. Figure 3 provides an overview
(taxonomy) of the many different schemes for interconnecting
processing elements, memory, and I/O devices for any system,
including SoCs. Each interconnection architecture has advantages and
disadvantages based upon cost, complexity of implementation,
complexity of usage by firmware and software, and performance. Since an
SoC typically includes all resources needed on a single chip and
often includes multiple processor cores (the IBM Cell architecture
works this way, for instance), the interconnection scheme is critical
to SoC design.
Figure 3. Interconnection network taxonomy
Static
interconnection topologies
- Point-to-point: One-to-one connection of nodes, N nodes,
N-1 connections, N-1 hops to any node
- Ring: One-to-one connection and first to last, N nodes,
N-1 connections, N/2 hops to any node worst case
- Hub: One central node connected to N-1 nodes, N nodes, N-1
connections, two hops from any node to another
- Tree: One root node connected to M sub-nodes for N nodes,
N-1 connections, log(N) or fewer hops to any node
- Square mesh: Each node connected to four nearest
neighbors, N nodes, 2*N - 2*square-root(N) connections, square-root
(N) hops worst case
- Fully connected: Each node connected to all others, N
nodes, N(N-1) connections, one hop all cases
Dynamic
interconnection networks
- Bus: Arbitrated transactions for read and write (split
transaction read, posted write optimizations)
- Blocking switch: Some active circuits prevent others from
becoming active
- Non-blocking switch: All circuits may be active
simultaneously
SoC memory
A modern GPC memory hierarchy can be complex: it might include
registers, L1 instruction, L1 data, L2 and L3 unified caches, on-chip
SRAM, off-chip DDR, and virtual memory backed by random access
storage. By comparison, an SoC design must fit the entire memory
system on-chip. So, most often, an SoC will include L1 instruction
and data caches with tightly coupled fast access SRAM. The SoC
design might be complicated by the inclusion of multiple processors,
especially for message passing or data sharing.
SoC cache considerations typically simplify the GPC cache design to
reduce levels and in some cases fully eliminate cache. For multiprocessor (MP) SoCs,
cache coherency is an issue that can be solved with hardware
protocols or software-managed caches. If a cache allows for DMAs, it
is said to be a push cache, a feature that can greatly
accelerate store and forward designs. The following list includes
key architectural decisions that must be made regarding an SoC cache
design:
- Will each processor incorporate a traditional GPC Harvard
architecture for cache?
- How will cache coherency be maintained for DMA I/O interfaces?
- If multiple processors will use cache, how will coherency be
maintained between processors sharing memory?
- How will cache be implemented and maintained?
- Will you be using a traditional GPC set-associative hardware
design, or perhaps a simpler direct-mapped cache? Or will you take a
software management approach to manage data in fast access memory
buffers?
Carefully consider the list of cache design options below from a hardware and software viewpoint:
- Hierarchical Harvard architecture: Code and data are
stored and cached separately, as opposed to a uniform cache that
caches both code and data.
- Cache coherency: If data can be DMA'd (transferred through DMA) or imported to
memory under cache, it must be invalidated before it is read; if it
can be exported, it must first be flushed from cache.
- MP cache coherency: MP designs that share data will have
cache coherency issues that can be solved by protocols implemented in
hardware or software such as MOESI.
- Push cache: A cache that can be DMA'd into and out of.
- Direct mapped cache: A cache where every line/set has only
one destination when loaded from main memory.
- N-way set associative cache: A cache where every line/set
has N possible destinations when loaded from main memory, with the
line to be replaced chosen according to a policy such as Least
Recently Used (LRU).
- Software cache: Software implements traditional cache
miss, hit, flush, invalidate, load operations using a low latency
memory.
Due to the complexity of hierarchical cache designs, SoCs often
include simpler approaches to speed memory access, including dual or
multiported memory, allowing for DMAs, and multiprocessor access to
shared memory. In general, the latency for memory access is the most
likely cause of significant inefficiency in SoC processing, so
incorporation of L1 caches or careful design to minimize latency
using a tightly coupled on-chip SRAM is most often used.
Complicating an SoC design with multilevel cache such as a unified
L2 might not be worth the cost compared to inclusion of fast access
memory scratch pads and multiported memory buffers. The following
list provides an overview of some of the memory design decisions the
SoC architect must consider:
- Dual or multiported memory: Blocks can be DMA'd into or
out of while simultaneously being accessed by a processor using
different blocks
- Content addressable memory (CAM): The equivalent of an
associative array where a data value presented to the CAM returns its
storage address(es)
- Tightly coupled memory: Memory with very low latency
access for a given processor
- Symmetric multiprocessing (SMP): Global shared memory with common access time for all
processors in an MP architecture
- Non-Uniform Memory Access (NUMA): Memory banks with faster access to some processor
nodes compared to others in an MP architecture
Future
directions for SoC discussion
Future discussion in this series will expand upon the resource view
presented here with specific design examples and discussion of
design concepts, including:
- Design methods
- Processing resource analysis, policies, and management
- I/O resource analysis, policies, and management
- Memory resource analysis, policies, and management
- Hardware and software resource trade-offs
- SoC debugging
- SoC EDA and verification
- SoC design case studies, including Cell, Blue Gene, and
others
Perhaps SoC design is really not that much different from system
design in general; however, the risks and rewards related to a single-chip solution are both greater. This series therefore explores
disciplined analysis and design for both hardware and software, since
the cost of improperly sizing resources in an SoC may be much higher
than a similar error in multichip designs. Future articles in
this series will examine single resources and trade-offs between
hardware and software complexity and cost based upon design decisions
that size SoC resources. Right-sizing resources in SoC design is
critical for these newly emerging single-chip architectures.
Resources Learn
-
The Wikipedia definition of a system-on-a-chip is helpful to understand what this emergent
design concept really means.
-
This IBM Research Blue Gene project Web page provides an excellent
overview of Blue Gene architecture and news.
-
Arnd Bergmann describes the Cell architecture in an interview on
developerWorks, "Arnd Bergmann on Cell" (June 2005).
-
Writing code for the Cell chip is not quite like writing code for a
GPC. The Cell compiler helps abstract the specifics, though,
so efficient code can be generated from high-level languages like C.
-
Chapter eight of Highly Parallel Computing, G.S. Almasi and A.
Gottlieb (Benjamin/Cummings Publishing, 1989) provides a good
overview of interconnection networks. Likewise, Optical Networks: A Practical Perspective, R.
Ramaswami and K. Sivarajan (Morgan Kaurmann Publishers, 2002)
provides a nice review of blocking and non-blocking switches.
-
The IBM POWER5™ includes an SMT engine, allowing multiple
threads of execution to execute more efficiently on each processor.
-
Thread scheduling policies specialized for soft real-time, including
EDF (Earliest Deadline First)
and LLF (Least Laxity First), are examples of custom SoC processor
scheduling that might be considered for continuous media applications
like gaming systems. For digital control and hard real-time SoC
applications, the hard real-time rate monotonic policy is safer.
Most GPCs provide a form of RR (Round Robin) timesliced scheduling
for fairness and responsiveness expected by users. SoC designers might
want to consider hardware support for more specialized scheduling and
thread control.
-
SoCs incorporating multiple processors will often be asymmetric,
whereby processors are dedicated to specific services once and for
all; often the NUMA memory model is used to speed up each
processor, since interaction between processors is only occasional.
For GPCs with multiple processors, the level of memory sharing and
synchronization needed is hard to predict, so most often multiprocessor GPCs are based on an SMP architecture.
-
Many SoCs have a flatter memory with TCM (Tightly Coupled Memory)
that is on-chip, with low latency access so that the memory and
processor core speeds are matched or closely matched. Off-chip
memory or high latency access memory will slow down a processor
significantly without cache. Many SoCs include L1 (Level 1 single
cycle access) instruction and data cache on the order of 16 to 128KB. When multiple processors are incorporated, each with their own
cache, the coherency of data shared between cores through global on-
chip memory becomes an issue. The SoC designer should consider a cache coherency
protocol such as MOESI, MOSI, MESI, or MSI.
-
The Wikipedia definition of rate monotonic scheduling is
a good place to start to gain an understanding of real-time
processing resource analysis, but the serious designer should consult
the original paper and more current and precise rate monotonic
analysis methods. Liu and Layland's original paper, "Scheduling algorithms for
multi-programming in a hard real-time environment" (Journal of
the Association for Computing Machinery, 1973) is one of the most
frequently cited original works on real-time processor scheduling.
The book Meeting Deadlines in Hard Real-Time Systems: The Rate
Monotonic Approach, L. Briand and D. Roy (IEEE Computer
Society Press, 1997) provides a more comprehensive overview of
current rate monotonic methods of analysis.
Get products and technologies
-
The Cell chip will be embedded in the PlayStation 3 gaming system. Competing
systems, including the new Microsoft Xbox 360 with 3.2GHz PowerPC®-based
chip and the Nintendo Revolution, will also be Power Architecture
technology-based.
-
In general, SoCs can be designed as custom ASICs using soft cores
such as the Tensilica Configurable
SoC, or using a combination of hard cores such as the PowerPC 405
found in the Virtex-II Pro 4
Reconfigurable SoC and custom FPGA surrounding logic. Tensilica-configurable SoCs include instruction set extensibility with TIE and
VLIW FLIX. Definition of new VLIWs (Very Long Instruction Words)
is one way to accelerate common firmware computations.
About the author  | 
|  | Dr. Sam Siewert is an embedded system design and firmware engineer
who has worked in the aerospace, telecommunications, and storage
industries. He also teaches at the University of Colorado at Boulder
part-time in the Embedded Systems Certification Program, which he
co-founded. His research interests include autonomic computing,
firmware/hardware co-design, microprocessor/SoC architecture, and embedded
real-time systems. |
Rate this page
|  |