Page 1 of 1
Many advanced applications for high-performance embedded computing demand massive amounts of computing power. As radar systems and other sensor systems get more complicated, the computational requirements are becoming a bottleneck. Real-time, on-platform systems in applications such as persistent surveillance and electronic warfare have huge Gflops and/or Gbyte/s requirements, and strict power and size limits. Traditional CPU-based boards simply do not meet computational or size, weight and power (SWaP) constraints.
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the microprocessor. The highly parallel structure of GPUs makes them more effective than general-purpose CPUs for a range of complex algorithms. Historically, GPUs have been viewed as compelling, programmable floating-point graphics rendering engines designed specifically for personal computers, workstations and gaming consoles. Also, the availability of embedded GPU solutions suitable to the stringent requirements of high-performance signal processing has been scarce.
With recent architectural advancements, however, the algorithmic scope of GPUs has grown dramatically. Non-video applications such as signals intelligence (SIGINT), oil and gas exploration, security, signal processing and video transcoding are now addressed using GPUs with excellent results. GPUs excel at traditional signal processing algorithms (like the FFT), and industry performance benchmarks on implementing GPUs in high-performance signal processing applications have shown that GPUs can obtain 20x performance improvement and more over other processors as shown by the Georgia Tech Research Institute. The ability of GPUs to process radar, infrared sensor and video data faster than a typical CPU make them ideal for use in high-performance embedded computing applications where performance metrics are paramount.
Large performance differences exist between leading CPUs and GPUs, as shown in recent lab tests, which measured theoretical peak performance, performance per watt and performance per watt for the FFT on the Intel Core i7, the MPC8641D and the ATIRV770GPU. The theoretical peak is based not on an actual performance from a benchmark run, but by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. In terms of theoretical peak performance, the Intel Core i7 delivered 94 Gflops, the MPC86941D performed at 20 Gflops, and the ATIRV770GPU provided 1200 Gflops performance. (Figure 1).
Theoretical peak performance comparison between the Intel Core i7, the Freescale MPC86941D and ATI ATIRV770GPU.
While this is an interesting comparison, a more significant metric is the performance of each processing device on relevant algorithms. For embedded signal processing—particularly radar and signals intelligence systems—the 1K complex single-precision ?oating-point FFT is one of the most common and important benchmarks. This algorithm was measured at 65 Gflops on-chip (L1 to L1) using all four cores on the Core i7, 14 Gflops on-chip (L1 to L1) using both cores on the MPC8641D, and 305 Gflops on the RV770 GPU while streaming (DRAM to DRAM). (Figure 2).
1K complex single-precision ?oating-point FFT performance comparison between the three processors.
A critical measurement for embedded applications, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. In these tests, performance per watt was calculated using 130W as the total chip power for each device. The Core i7 had 0.72 peak theoretical Gflops/watt, the MPC8641D had 0.50 Gflops/watt, and the RV770 GPU had a much higher 9.23 Gflops/watt using peak board power. This included power for the GPU, memory, interfaces and power conversion (Figure 3).
Peak theoretical Gflops/watt performance compared between the three processor types.
Similarly, computing the performance per watt for the FFT resulted in 0.41 Gflops/watt for the MPC 8641D, 0.5 1K single-precision FFT Gflops/watt for the Core i7, and 2.35 single-precision FFT Gflops/watt for the RV770 GPU. This is almost a 5x improvement in performance over the Core i7, and almost a 6x performance improvement over the MPC8641D, even though the Core i7 is sourcing and sinking data from/to L1 cache and the GPU is streaming from/to off-chip DRAM (Figure 4).
1K single-precision FFT Gflops/watt compared.
These performance differences are due to the basic structure and function of the processors. CPUs feature large caches and branch-prediction logic for decision-based code and more complex algorithms. They excel at ?ow control and disposition of data, for which they were principally architected. Conversely, GPUs are massively parallel array processors with limited branching performance. They operate on large amounts of data simultaneously because they have been architected to maximize arithmetic performance for graphical operations. They are compute and memory bandwidth-intensive machines, containing small amounts of cache, which are optimized for large dataset throughput with computational kernels. This property makes GPUs very successful in compute-intensive signal and image processing systems.
The data referenced was generated using workstation cards in a benign environment. GPU performance was also measured in Mercury Computer Systems’ Sensor Stream Computing Platform (SSCP), a 6U VXS-based system that harnesses the processing power of GPUs for high-performance, data-parallel computing in rugged environments.
The SSCP can be con?gured with one or two VXS-GSC5200 boards, each of which can be con?gured with one or two GPU-based MXM modules, currently ATI 4870M or NVIDIA QUADRO 3600M. An air- VX6-200 Dual Dual-Core Xeon VXS Single-Board Computer is included in the SSCP for I/O and control. The peak theoretical performance of the SSCP system is shown in Figure 5
Gflops performance of the SSCP increases linearly with an increase in clock speed.
A crucial feature of the SSCP is the ability to “tune” the power signature of the GPUs. This is particularly useful for on-platform applications, where peak algorithm performance per watt is extremely important. In the previous chart, the peak Gflops of the various SSCP con?gurations (single, dual and quad ATI 4870M, and single NVIDIA QUADRO 3600M) increase linearly with an increase in GPU clock rate. This is just one of the “knobs” users can use to optimize the performance per watt on a deployed system-by-system basis.
SSCP performance was also measured in terms of Gflops per chassis watt. Because the Gflops performance of the SSCP increases linearly with an increase in clock speed (Figure 5), dividing the Gflops performance for a particular algorithm (1K complex single-precision FFT and fast convolution in this example) by the chassis power dissipated by a particular SSCP con?guration yields a new metric, the Gflops per chassis watt. The power is full “draw-from-the-wall” watts; this includes fans, power supply inefficiencies and the x86 host processors, as well as the power for the GPU(s).
Gflops performance of the SSCP increases linearly with an increase in clock speed.
The ATI RV770 GPU signi?cantly outperforms both the Intel Core i7 and the MCP8641D processors in terms of Gflops and Gflops/watt for peak and FFT/fast convolution metrics. Even more dramatic, two orders of magnitude performance improvements of the RV770 are seen when running Mercury image processing libraries over quad-core Xeon processors running IPP in areas such as image formation and analysis for persistent surveillance. Mercury’s hardware design permits rapid technical insertion for its GPU-based products.
The new generation of mobile defense platforms for C4ISR applications requires more optimization for size, weight and power (SWaP) than ever before. As seen throughout this article, GPUs are increasingly valuable components of almost all embedded systems that require fast, standards-based processing of display data. Powerful GPU-based solutions like those from Mercury address these challenges by combining expertise in embedded computing, innovation in high-performance signal and image SWaP-constrained systems and ruggedized designs. Due to a modular approach to deployed GPUs, Mercury is able to upgrade new GPU technology rapidly. Mercury is currently validating new, state-of-the-art GPUs from both NVIDIA and ATI for its OpenVPX product line. These GPUs have already shown dramatically improved peak performance, performance per watt and algorithmic scope.
Mercury Computer Systems