Stream Processors Announced Breakthrough Digital Signal Processor Architecture

Building on more than eight years of research at Stanford and MIT,startup reveals a new class of DSPs that makes parallel processing simple: Emerging from two years of commercial development and more than eight years of university research, Stream Processors, Inc. (SPI) today unveiled a breakthrough digital signal processor (DSP) architecture that removes the barriers to programming high-performance, massively parallel processors. Detailed in a paper, “A 512 GOPS Stream Processor for Signal, Image and Video Processing,” being presented at this year’s International Solid State Circuits Conference, SPI’s Stream Processor™ Architecture combines unmatched levels of DSP performance with a simple and efficient C-programming model. The approach has resulted in the development of the industry’s highest-performance family of DSPs, capable of delivering greater than an order of magnitude (more than 10 times) higher performance than current commercially available DSP solutions. To place this level of processing performance in context, a single fully software- programmable SPI Stream Processor is capable of encoding H.264 high-definition 1080p video in real-time with enough processing power to perform customer-specific video enhancements, image tuning, and video content analysis. Achieving that level of performance using traditional DSPs could require as many as 15 chips, significantly increasing engineering effort, development time and overall project risk. Making Parallelism Work Once confined to the realm of supercomputers, the concept of parallel or multi-core processing – using more than one central processing unit (CPU) or processor core to increase computation speed – has long been seen as a way to achieve higher levels of performance. Recently, multi-core solutions such as the AMD dual-core Opteron and the Intel Core2 Duo have been shown to be effective when running large independent tasks at the operating system level. However, multi-core architectures have not been successful at accelerating individual embedded DSP applications. “The key problem has been that writing software to take full advantage of the increased processing power offered by parallelism has always been time- consuming and difficult,” said Will Strauss, president of the market research firm Forward Concepts. “While Intel and AMD have started to solve this problem in the personal computing and server markets with multi-core processors, the problem remains in embedded markets. These markets require an energy-efficient, programmable digital signal processor with the computational capacity of tens to hundreds of cores applied to individual tasks. By re-thinking the roles of the architecture, programming model and compiler tools, SPI has created a new class of DSPs that makes parallel processing practical.” Prof. Bill Dally, co-founder, chairman and chief science officer for SPI added, “When we began our research 12 years ago, we quickly realized that traditional architectures were running out of steam. A new approach was needed. Simply putting more cores on a chip doesn’t address the real issues of bandwidth, data locality, and ease of programming. Today’s demanding embedded applications like H.264 HD encoding and analytics, image processing, video surveillance, wireless communication, search, and encryption, all benefit from the performance gain and programming simplicity offered by SPI’s Stream Processor Architecture.” The Stream Processor Architecture At the heart of SPI’s Stream Processor Architecture is a high-performance data-parallel unit (DPU), which is able to sustain hundreds of billions of operations per second (GOPS). Two industry-standard CPU cores are included to support the DPU: a system CPU runs Linux and handles I/O; another core runs main DSP threads and offloads processing of compute-intensive kernel functions to the DPU. A key feature of the architecture is its compiler-managed memory hierarchy that leverages the data-parallelism and locality characteristics of signal processing applications. A simple C programming model allows specification of compute-intensive kernel functions that process streams of data records, enabling the compiler and hardware to efficiently manage on-chip memory and synchronize runtime direct-memory access (DMA). This approach eliminates the need for a cache and greatly increases predictability of throughput, simplifying the overall programming task. The architecture exploits multiple levels of parallelism: • task-level parallelism between the system processor, DSP processor and DPU • data-level parallelism (DLP) with multiple lanes executing the same instructions on different data in parallel • instruction-level parallelism (ILP) via very long instruction word (VLIW) driving multiple arithmetic logic units (ALUs) per lane • sub-word single instruction multiple data (SIMD) in which each ALU can operate on multiple operands On the DPU, a kernel function runs identically on every lane processing different data. Built-in support for conditionals and high-speed inter-lane communications provides more versatility than conventional SIMD architectures. The single-threaded execution model provides inherent load-balancing, eliminating the need for code partitioning across multiple cores. Another advantage to SPI’s architecture is the ability to easily scale to higher levels of performance by adding more lanes without the need to restructure software. Development Tools SPI’s RapiDev tool suite supports a standard development and debug flow using C language tools running on a Windows or Linux platform. RapiDev leverages the predictability of SPI’s Stream Processor Architecture to provide a linear path to performance-optimized code. The tools suite enables application source code compatibility across devices with different numbers of lanes and ALUs, providing greater scalability and portability.