SoC drawer: SoC design for hardware acceleration, Part 1

Building today's software to become tomorrow's hardware engine

System-on-chip (SoC) designs offer the opportunity to migrate functionality initially implemented in software and firmware into hardware acceleration engines and state machines. Reconfigurable SoCs based on processors in FPGA fabric, such as the PowerPC® 405 in the Xilinx Virtex-4, provide a platform for rapid migration of functionality from PowerPC software and firmware to the FPGA logic. Configurable application-specific integrated circuit (ASIC) SoCs can be optimized similarly over product revisions as SoC ASIC roadmap configurations are defined. This article examines methods for software design, specification, and implementation that will simplify future efforts to offload software functionality to hardware. Basic video and image processing algorithms provide working example algorithms for this article and the next.

Share:

Sam Siewert (Sam.Siewert@Colorado.edu), Adjunct Professor, University of Colorado

Sam SiewertDr. Sam Siewert is an embedded system design and firmware engineer who has worked in the aerospace, telecommunications, and storage industries. He also teaches at the University of Colorado at Boulder part-time in the Embedded Systems Certification Program, which he co-founded. His research interests include autonomic computing, firmware/hardware co-design, microprocessor/SoC architecture, and embedded real-time systems.



06 June 2006

SoCs as configurable ASICs and reconfigurable computing platforms have become common in less than a decade. An SoC is simply a chip on which an entire system is designed into a single part or reconfigurable hybrid part. Reconfigurables offer the ultimate flexibility to implement hybrid reconfigurable services that blur the traditional line between hardware and software.

This article is the first part of a two-article series, which examines strategies for designing algorithms that will be initially implemented in software but ultimately migrated to a hardware field-programmable gate array (FPGA; a reconfigurable SoC) or ASIC SoC implementation. New protocols, novel algorithms, and still immature algorithms for products are often implemented in software to mitigate risk and provide for simple field upgrade. When such algorithms are mature and well understood, moving them into hardware can help reduce SoC cost, accelerate performance dramatically, and free up resources for new features.

From software to hardware

Historically, hardware and software interfaces have been carefully designed to strictly separate services into software threads and hardware state machines. Some more extensible methods have included microprogrammable engines that extend the register and memory-mapped I/O interfaces of state machines to include very simple instruction sequences. For example, video encoders provide capture of NTSC or HDTV analog inputs into a wide range of compressed and uncompressed digital formats that can be programmed with microcode. Likewise, digital media adapters (DMAs) that move encoded video data are often microprogrammable as well, so that stream buffering and synchronization with processing can be programmed. The Connexant Bt878A NTSC encoder provides just this type of sequencing with programmable video capture formats, DMA transfer, and host interrupt assertion. (Find out more about the Bt878 in the Resources below.) This ASIC has evolved since its introduction as the Brooktree Bt848: additional sequencer capabilities have been added, including MPEG compression. Video encoder chips, which have a history of automating what was previously done with software, provide a good illustration of applications that can be pushed down into hardware.

Ideally, in the future, the hardware/software interface will allow system designers to fluidly move functions and services between hardware and software implementation without requiring a major re-engineering effort. How? Quite simply, by incorporating more formal specification for services and functions in system design languages that lend themselves to soft, FPGA, or hard realizations. The key is automating as much code generation, compilation, and synthesis for hardware as possible, and leaving much of development to the tool chains rather than the artistry of the programmer or hardware engineer. Significant research and effort to define new languages that extend hardware definition languages (HDLs) to interface to test bench procedural languages, and extension of procedural languages for hardware description like SystemC, is pushing in this direction. While these language extensions and attempts to integrate design continue to evolve, this article suggests methods that make it easier to transition algorithms between well-known languages like C and Verilog.

Some processing constructs that can be implemented in both hardware and software specification languages are:

  • Logical expressions: Simple C expressions and combinational logic in HDL

  • Functional state machines: Mealy/Moore state machines, clocked in HDL, and run to completion in software. The distinct difference between the hardware state machine and software is the timing on transitions, with hardware requiring a single clock edge and software often requiring multiple clocks. Careful design of software state machines, with all processing in a state rather than on the transition, helps minimize the differences.

  • Execution sequences: Instruction-driven clocked state machines in HDL and command-driven interfaces in software

While hardware implementations have the distinct advantage of true parallel execution, the emergence of symmetric multithreading to provide low-level parallelism for software sequences also provides true microparallelism and speed-up for software threads of execution. Combinational logic and expressions in software can be sped up with vector processing, as well with engines like the PowerPC's AltiVec. Viewed at this low and simplistic level, why is software specification so different from hardware?

Software designers have traditionally not had to worry about timing, except when it comes to synchronizing virtually parallel threads of execution and thread/interrupt interfaces. Hardware engineers, on the other hand, spend significant effort in timing verification, working with clock-cycle-accurate models and simulations of physical devices. One plausible explanation for the degree of separation between hardware and software specification is that software programmers have not exercised discipline in specification because hardware abstracts timing and for the most part has left software engineers with a simpler working model.

Any system designer who has worked on an FPGA project has most likely had the following experience: An algorithm first implemented in C is offloaded to hardware because it overloads a general purpose processor with high frequency service requests. In your experience, what was it about the C code that made it hard or easy to migrate into an FPGA state machine or sequencer? If such algorithms implemented in C were first designed by a software engineer thinking in terms of logical expressions, functional state machines, and clearly defined parallel and serialized execution sequences, then my bet is that the port from C code to Verilog and subsequent hardware implementation was not too difficult. However, if the C code instead included functions with multiple entry and exit points, no clear definition of what could be done in parallel versus serial synchronous sequences, numerous side effects, unclear state or execution context, and reliance upon operating system mechanisms for correctness, then the port was probably difficult.

Quantum programming

Quantum programming defines a software engineering process using UML statecharts and C/C++ state machine implementation that lends itself to migration from software run-to-completion threads to hardware state machines. See Resources for a link to more on quantum programming.

The idea of migrating general-purpose computing (GPC) Windows® or Linux® code directly to an FPGA seems preposterous. The problem of migrating carefully designed firmware based upon extended finite state machine design is by comparison much more tractable. Firmware engineers technically write code that boots hardware and operating systems to make it useable by application programmers. The theory that this abstraction makes programming more accessible has played out well on the GPC platform. In fact, it has played out so well that an application programmer can now build applications in a virtual environment that are very difficult to bring back down into a hardware implementation. An emergent class of applications where hardware acceleration is critical and the ability to move between software and hardware with quick and even automated adaption includes:

  • Real-time media: Video and audio on demand, virtual presence, visualization, gaming, and education/training

  • Network protocols: Almost always first implemented with software/firmware for quick adaptation to changing standards, but ultimately migrated to hardware for acceleration and cost reduction

  • Digital signal processing: A well-established hardware acceleration domain with specialized instructions

  • Compression/decompression: A well-established off-load requiring constant re-engineering to move algorithms from software to hardware

  • Encryption/cryptanalysis: Another well-established off-load domain

  • Advanced human/computer interaction: Includes haptic feedback, speech to text/text to speech, natural language processing, voice recognition, computer vision, and virtual reality

  • Digital coding and stream transforms: For XML, VRML, video/audio streams, and emergent streams for immersive and real-time human-computer interaction

  • Data mining, search, and associative recall: Emerging wisdom in computer architectures is that rapid intelligent recall and information search with generalization require massive parallelism.

Most interesting future applications appear to require an end to the strict separation of hardware and software design and the convergence of design methods, or perhaps the definition of new forms of systems engineering that are hardware/software agnostic. These applications all share at least one common characteristic: they can be substantially sped up with parallelism and highly interconnected networks of compute engines.


Design methods for future hardware offload

As a working example, imagine a very simple futuristic application that recognizes the faces of employees and greets them, and stops visitors and asks them for identification and to wait for an escort. It's not clear at all that this is even remotely possible with a GPC approach running a software application -- at least, not with any measure of reliability, and not at a reasonable cost. Some computer vision researchers would argue that in fact this can't be done with GPC/software approaches at all; others believe that it will eventually be possible with improvements in GPC computing power.

Rather than get embroiled in the argument, let's look at the algorithms and their specification, and how these can be implemented in software so that they can be more easily offloaded to hardware for acceleration with parallel, highly interconnected execution. Obviously a few SoC drawer articles won't solve this problem, but I'll attack it to explore methods for specification that will allow us to establish some software algorithms. I'll implement these in an HDL and FPGA or SystemC simulator in the next article on this topic.

Listed in order of increasing degree of difficulty, here are some basic functions and services that you will eventually want to implement in hardware, but that we'll implement in software for now:

  • Analog-to-digital sensitivity and lighting aperture control: Camera analog-to-digital conversion and aperture settings can be adjusted in real time to prevent washout (digital overflow) from too much light or blackouts from too little.

  • Tilt/pan scanning: The tracking of objects, and, perhaps more importantly, the scanning of scenes, is critical. The human eye uses saccadic motion, as well as luminance-sensitive portions of the retina, to scan scenes with the color-sensitive fovea. (See Resources for more on human vision.)

  • Edge enhancement and detection: The first step in analysis of a scene is to find object boundaries.

  • Image segmentation and bounding box computations: The second step in analysis is to determine object extents and to parse the scene fully.

  • Stereo range estimation to segmented objects: You need to use two or more cameras and triangulation to estimate distances to parsed objects and relative separations of objects in three-dimensional space.

  • Object classification: You need to bin segmented objects based upon properties, including shape, color, context, and relative feature sizes -- in general, invariant properties that can assist with recognition.

  • Recognition: This involves the execution of models like a Bayesian belief network or hidden Markov model to group objects into meta-objects like faces, which are composed of a nose, eyes, ears, and so on.

  • Matching: This consists of associative recall of invariant features from a database to find potential matches for the face. Research for methods to implement content-addressable and associative memory at IBM® goes back to the 1960s. Much of the basic technology has found its way into modern set associative cache. The Hopfield ANN (artificial neural network) has also been shown to operate as an associative memory (see Resources for more on Hopfield and ANN).

The last two tasks in the list above are by far the hardest to perform accurately. Our working example may seem a bit far reaching, but when you consider security applications and management of ever increasing Web-based multimedia content, the simple automation of face recognition has huge benefits to society and substantial markets as well as disturbing implications of Big Brother watching (see the GIO report in Resources). This is also a good motivating example for an emergent application that truly benefits from a hybrid reconfigurable SoC architecture.

The image in Figure 1 is a red dot hanging on a wall. The image itself is 320 by 240 pixels, in an RGB digital format, originally PPM, and is included here as JPEG. It would be trivial for a human observer to track the center of the dot and to mark the XY center on the image. For computer vision, this requires edge detection and image rastering to determine the centroid.

Figure 1. Example image prior to edge enhancement
Example image prior to edge enhancement

To make the job of finding edges and the centroid of the target simpler, a first pass applies the algorithm in Listing 1 in order to enhance edges and to filter noise in the image. Please note that the code in Listing 1 is not complete; it provides only the basic edge-enhancing kernel transformation of the image array.

This basic image transformation is the working example I'll use for the next installment in this series, when I'll show how to re-engineer this code first as a state machine for transforming a stream of video frames and then into SystemC to provide a state machine specification suitable for hardware offload.

Notice that the algorithm applies the edge enhancement kernel by looping through the image buffer and applying the point spread function (PSF) to a pixel and all of its nearest neighbors (see Resources for more on PSF). All of the file-based operations will also be eliminated, and the image will be assumed to have been DMA transferred into memory by a video encoder like the Bt878 as well. In the meantime, this quick and simple implementation will allow you to become familiar with the algorithm as simple procedural C code.

Listing 1. Simple edge enhancement
// This code assumes that image data has been loaded from a file format or from a video
// encoding stream source like the Bt878.  The function applies the edge enhancement 
// PSF described on page 402 of the Scientist's and Engineer's Guide to
// Digital Signal Processing.
unsigned char R[76800];
unsigned char G[76800];
unsigned char B[76800];
unsigned char convR[76800];
unsigned char convG[76800];
unsigned char convB[76800];

#define K 1
unsigned char PSF[9] = {-K/8, -K/8, -K/8, -K/8, K+1, -K/8, -K/8, -K/8, -K/8};

void enhance_edges(void)
{
    // Skip first and last row, no neighbors to convolve with
    for(i=1; i<239; i++)
    {
        for(j=1; j<319; j++)
        {
            convR[(i*320)+j]=0;
            convR[(i*320)+j] += PSF[0] * R[((i-1)*320)+j-1];
            convR[(i*320)+j] += PSF[1] * R[((i-1)*320)+j];
            convR[(i*320)+j] += PSF[2] * R[((i-1)*320)+j+1];
            convR[(i*320)+j] += PSF[3] * R[((i)*320)+j-1];
            convR[(i*320)+j] += PSF[4] * R[((i)*320)+j];
            convR[(i*320)+j] += PSF[5] * R[((i)*320)+j+1];
            convR[(i*320)+j] += PSF[6] * R[((i+1)*320)+j-1];
            convR[(i*320)+j] += PSF[7] * R[((i+1)*320)+j];
            convR[(i*320)+j] += PSF[8] * R[((i+1)*320)+j+1];

            convG[(i*320)+j]=0;
            convG[(i*320)+j] += PSF[0] * G[((i-1)*320)+j-1];
            convG[(i*320)+j] += PSF[1] * G[((i-1)*320)+j];
            convG[(i*320)+j] += PSF[2] * G[((i-1)*320)+j+1];
            convG[(i*320)+j] += PSF[3] * G[((i)*320)+j-1];
            convG[(i*320)+j] += PSF[4] * G[((i)*320)+j];
            convG[(i*320)+j] += PSF[5] * G[((i)*320)+j+1];
            convG[(i*320)+j] += PSF[6] * G[((i+1)*320)+j-1];
            convG[(i*320)+j] += PSF[7] * G[((i+1)*320)+j];
            convG[(i*320)+j] += PSF[8] * G[((i+1)*320)+j+1];

            convB[(i*320)+j]=0;
            convB[(i*320)+j] += PSF[0] * B[((i-1)*320)+j-1];
            convB[(i*320)+j] += PSF[1] * B[((i-1)*320)+j];
            convB[(i*320)+j] += PSF[2] * B[((i-1)*320)+j+1];
            convB[(i*320)+j] += PSF[3] * B[((i)*320)+j-1];
            convB[(i*320)+j] += PSF[4] * B[((i)*320)+j];
            convB[(i*320)+j] += PSF[5] * B[((i)*320)+j+1];
            convB[(i*320)+j] += PSF[6] * B[((i+1)*320)+j-1];
            convB[(i*320)+j] += PSF[7] * B[((i+1)*320)+j];
            convB[(i*320)+j] += PSF[8] * B[((i+1)*320)+j+1];
        }
    }
}

After this algorithm is applied, the image is transformed as Figure 2 shows. Applying this simple algorithm can use quite a bit of CPU on a GPC machine, since the code must walk through all 76,800 pixels in the image sequentially. Offloading this to an FPGA for parallel application off the kernel and filtering would accelerate the transform and free the processor for other work. We'll do this in the next SoC drawer article, specifying the hardware algorithm with SystemC.

Figure 2. After edge enhancement and filtering
After edge enhancement and filtering

Finally, the enhanced image is rastered again to find the edges and the centroid of the target. This is also a costly operation in software on a GPC machine that can be accelerated with FPGA hardware.

Figure 3. Centroid determined with enhanced image
Centroid determined with enhanced image

Conclusion and a look ahead

Specific classes of applications can not only benefit from the parallel execution capabilities of hardware state machines and combinational logic, but arguably require such an implementation. The image processing examples presented in this article clearly show the benefit. More sophisticated applications such as facial recognition will need a hybrid hardware/software SoC system design.

The example presented here is not nearly as complex as face recognition, but clearly even simple tasks, such as segmentation of a scene into objects or tracking object centers, are computationally intensive and would benefit immensely from parallel execution. Beginning with this working example designed for C and a GPC system to transform files, the next SoC drawer installment looks at how you can implement this functionality as a service so that a POSIX thread can respond to a frame-ready event and compute the centroid at a streaming frame rate (typically 30 frames/sec). Then, you'll see how this can be accelerated with a reconfigurable hybrid SoC architecture.

Resources

Learn

  • Real-Time Embedded Components and Systems, Sam Siewert (Thompson Delmar, 2006): Get more resources on real-time computer vision systems and software, including a VxWorks version of the Linux BTTV Bt878 driver.
  • Bt878A video encoder: Connexant provides data sheets and detailed descriptions.
  • Quantum programming: State machine programming methods like this might offer a software engineering design approach more amenable to hardware automation down the road.
  • On Intelligence, Jeff Hawkins (Times Books, 2004): Describes intelligence as associative memory and the brain's unique ability to predict the future based upon neocortical memory.
  • Saccadic motion: Learn more about how the human eye works on Wikipedia.
  • Global Innovation Outlook: The social implications of technology and innovation are being explored through this IBM project.
  • "Associative memory with ordered retrieval," R. R. Seeber and A. B. Lindquist (IBM Journal of Research and Development, 1962): Very early work on associative memory devices.
  • The Scientist and Engineer's Guide to Digital Signal Processing: An excellent basic guide to the many algorithms used in computer vision processing.
  • Stanford University multi-camera array: Some very interesting ongoing research in arrays and image processing that employs large numbers of low-cost cameras. With the large number of cameras used for this type of work, it seems that an SoC with a multi-channel video encoder would be quite helpful.
  • Introduction to the Theory of Neural Computation, John Hertz, Anders Krogh, and Richard G. Palmer (Addison-Wesley, 1991): This book includes examples of Hopfield network associative memory and many resources, including example code.
  • "An introduction to neural networks," Andrew Blais and David Mertz (developerWorks, July 2001): A good start for those wishing to learn more about ANNs.
  • "Weave a neural net with Python," Andrew Blais (developerWorks, June 2004): A good article on Python-based ANNs.
  • Image Database Navigation and Visual Browsers: This site summarizes IBM's basic research on more intelligent Web browsers that will automatically catalog content and provide advanced searches. The utility of these types of browsers is growing as content grows on the World Wide Web and in home and corporate multimedia filesystems and databases.

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=126800
ArticleTitle=SoC drawer: SoC design for hardware acceleration, Part 1
publish-date=06062006