The Evolution of DSP Processors
The Evolution of DSP Processors
Presented to: CS152, University of California at Berkeley November 14, 1997 Jeff Bier Berkeley Design Technology, Inc. www.bdti.com bier@bdti.com (510) 665-1600
Copyright 1997 Berkeley Design Technology, Inc.
Page 1 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Outline
DSP applications Digital filtering as a motivating problem The first generation of DSPs, with an example Comparison of DSP processors to general-purpose processors DSP evolution continues... later-generation DSPs and alternatives Conclusions
Page 2 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Who Cares?
DSP is a key enabling technology for many types of electronic products DSP-intensive tasks are the performance bottleneck in many computer applications today Computational demands of DSP-intensive tasks are increasing very rapidly In many embedded applications, general-purpose microprocessors are not competitive with DSP-oriented processors today 1997 market for DSP processors: $3 billion
Page 3 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Example DSP Applications
Digital cellular phones Automated inspection Vehicle collision avoidance Voice over Internet Motor control Consumer audio Voice mail Navigation equipment Audio production Videoconferencing Pagers Music synthesis, effects
Satellite communications Seismic analysis Secure communications Tapeless answering machines Sonar Cordless phones Digital cameras Modems (POTS, ISDN, cable,...) Noise cancellation Medical ultrasound Patient monitoring Radar
Page 4 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Todays DSP Killer Apps
In terms of dollar volume, the biggest markets for DSP processors today include:
Digital cellular telephony Pagers and other wireless systems Modems Disk drive servo control
Most demand good performance All demand low cost
Trends are towards better support for these (and similar) major applications.
Page 5 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
DSP Tasks for Microprocessors
Speech and audio compression Filtering Modulation and demodulation Error correction coding and decoding Servo control Audio processing (e.g., surround sound, noise reduction, equalization, sample rate conversion, echo cancellation) Signaling (e.g., DTMF detection) Speech recognition Signal synthesis (e.g., music, speech synthesis)
Page 6 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
What Do DSP Processors Need to Do Well?
Most DSP tasks require:
Repetitive numeric computations Attention to numeric fidelity High memory bandwidth, mostly via array accesses Real-time processing
Processors must perform these tasks efficiently while minimizing: Cost Power Memory use Development time
Page 7 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
FIR Filtering: A Motivating Problem
x[n] D h[1] x[n-1] D x[n-(M-1)] D h[M] x[n-M]
h[0]
h[M-1]
y[n]
a tap Each tap (M+1 taps total) nominally requires:
Two data fetches Multiply Accumulate Memory write-back to update delay line
Page 8 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
FIR Filter on Von Neumann Architecture
loop: mov mov mpy add mov inc inc inc dec tst jnz *r0,x0 *r1,y0 x0,y0,a a,b y0,*r2 r0 r1 r2 ctr ctr loop Processor
Memory
Problems: Bus / memory bandwidth bottleneck, control code overhead
Page 9 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
First Generation DSP (1982): Texas Instruments TMS32010
16-bit fixed-point Harvard architecture Accumulator Specialized instruction set 390 ns MAC time (228 ns today)
Mult. Data or Program Bus Data Path Data Memory Program Memory
Register
ALU Acc.
(P register, shifters not shown)
Page 10 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
TMS32010 FIR Filter Code
Here X4, H4, etc. are direct (absolute) memory addresses: LT MPY LTD MPY LTD MPY X4 H4 X3 H3 X2 H2 ; Load T with x(n-4) ; P = H4*X4 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P ; P = H3*X3
etc.
Two instructions per tap, but requires unrolling
Page 11 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Features Common to Most DSP Processors
Data path configured for DSP Specialized instruction set Multiple memory banks and buses Specialized addressing modes Specialized execution control
Page 12 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Data Path
Specialized hardware performs all key arithmetic operations in 1 cycle. Hardware support for managing numeric fidelity:
Multiplies often take >1 cycle Shifts often take >1 cycle
Shifters Guard bits Saturation
Page 13 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Instruction Set DSP Processor
Specialized, complex instructions Multiple operations per instruction
General-Purpose Processor
General-purpose instructions Typically only one operation per instruction
mac x0,y0,a
x:(r0)+,x0
y:(r4)+,y0
mov mov mpy add mov inc inc
*r0,x0 *r1,y0 x0,y0,a a,b y0,*r2 r0 r1
Page 14 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Memory Architecture DSP Processor
Harvard architecture 2-4 memory accesses/ cycle No cacheson-chip SRAM
General-Purpose Processor Von Neumann
architecture
Typically 1 access/ cycle May use caches
Program Memory Processor Data Memory Processor Memory
Page 15 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Addressing DSP Processor
Dedicated address generation units Specialized addressing modes; e.g.:
General-Purpose Processor Often, no separate address
generation unit
General-purpose addressing modes
Autoincrement Modulo (circular) Bit-reversed (for FFT)
Good immediate data support
Page 16 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Execution Control
Hardware support for fast looping Fast interrupts for I/O handling Real-time debugging support
Page 17 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Specialized Peripherals for DSPs
Synchronous serial ports Parallel ports Timers On-chip A/D, D/A converters
Host ports Bit I/O ports On-chip DMA controller Clock generators
On-chip peripherals often designed for background operation, even when core is powered down.
Page 18 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Second Generation DSPs (1987) Example: Motorola DSP56001
24-bit data, instructions 3 memory spaces (X, Y, P) Parallel moves Single- and multiinstruction hardware loops Modulo addressing 75 ns MAC (21 ns today)
move move rep mac #Xaddr,r0 #Haddr,r4 #Ntaps x0,y0,a x:(r0)+,x0
P Memory X Memory Y Memory Data Path
y:(r4)+,y0
Other second-generation processors: AT&T DSP16A, Analog Devices ADSP-2100, Texas Instruments TMS320C50
Page 19 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Low-Cost General-Purpose Processor vs. Low-Cost DSP
Speed (BDTImarks)
16 14 12 10
GPP
DSP 8
8 6 4 2 0
ARM7TDMI 40 MIPS
Page 20 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
TMS320C50 40 MIPS
The Evolution of DSP Processors
Third Generation DSPs (1995) Examples: Motorola DSP56301, TI TMS320C541
Enhanced conventional DSP architectures 3.0 or 3.3 volts More on-chip memory Application-specific function units in data path or as co-processors More sophisticated debugging and application development tools DSP cores (Pine and Oak from DSP Group, cDSP from TI) 20 ns MAC (10 ns today)
Architectural innovation mostly limited to adding application-specific function units and miscellaneous minor refinements.
Also, multiple processors/chip (TI TMS320C80, Motorola MC68356)
Page 21 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Fourth Generation (1997-1998) Examples: TI TMS320C6201, Intel Pentium with MMX
Todays top DSP performers adopt architectures far different from conventional DSP processor designs.
Blazing clock speeds and superscalar architectures make some general-purpose processors, such as the PowerPC 604e, top floating-point performers, despite lack of many DSP features Multimedia SIMD extensions, such as MMX, offer strong fixedpoint performance on general-purpose processors
But strong DSP tools for general-purpose processors are lacking
VLIW-like architectures, such as that of the TI TMS320C6201, achieve top performance via high parallelism and increased clock speeds
Page 22 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
VLIW
Very long instruction word (VLIW) architectures are garnering increased attention for DSP applications. Notable recent introductions include Texas Instruments TMS320C62xx and Philips TM1000. Major features:
L1 16Kx32 On-Chip Program Memory 256 Dispatch Unit
S1
M1
D1
L2
S2
M2
D2
Register File A Data Path 1
Register File B Data Path 2
32 32 32Kx16 On-Chip Data Memory
Multiple independent operations per cycle
Packed into a single large instruction or packet
More regular, orthogonal, RISC-like operations Large, uniform register sets
Page 23 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
VLIW
Advantages:
Increased performance More regular architectures
Potentially easier to program; better compiler targets
Scalable?
Disadvantages: New kinds of programming/compiler complexity Code size bloat
High program memory bandwidth requirements
High power consumption
Page 24 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
General-Purpose Processors are Catching Up
Go where the cycles are... General-purpose processors are increasingly adding DSP capabilities via a variety of mechanisms:
Add single-instruction, multiple-data instruction set extensions (e.g., MMX Pentium) Integrate a fixed-point DSP processor-like data path and related resources with an existing C/P core (e.g. Hitachi SH-DSP) Add a DSP co-processor to an existing C/P core (e.g., ARM Piccolo) Create an all-new, hybrid architecture (e.g., Siemens TriCore)
Page 25 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
The General-Purpose Processor Threat
High-performance general-purpose processors for PCs and workstations are increasingly suitable for some DSP applications.
E.g., Intel MMX Pentium, Motorola/IBM PowerPC 604e
These processors achieve excellent to outstanding floating- and/or fixed-point DSP performance via:
Very high clock rates (200-500 MHz) Superscalar (multi-issue) architectures Single-cycle multiplication and arithmetic ops. Good memory bandwidth Branch prediction In some cases, single-instruction, multiple-data (SIMD) ops.
Page 26 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
High-Performance General-Purpose Processors
Advantages:
Strong DSP performance Already present in PCs Strong tool support for the major processors Cost-performance can rival that of floating-point DSPs
Disadvantages: Lack of execution timing predictability Difficulty of developing optimized DSP code Limited DSP-oriented tool support High power consumption Cost-performance does not approach that of fixed-point DSPs
Page 27 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Real-Time Suitability
The most important DSP applications are real-time applications.
Many of these are hard real-time applications: failure to meet a real-time deadline creates a serious malfunction.
High-performance GPPs make heavy use of dynamic features:
Caches, branch prediction, dynamic superscalar execution, datadependent instruction execution times, etc.
These features result in timing behavior that appears to be stochastic.
Page 28 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Example of Optimization Challenge
Vector addition on PowerPC 604e: @vec_add_loop: lfsu fpTemp1,4(rAAddr) lfsu fpTemp2,4(rBAddr) fadds fpSum,fpTemp1,fpTemp2 stfsu fpSum,4(rCAddr) bdnz @vec_add_loop # Load A data, ptr. update # Load B data, ptr. update # Perform add operation # Store sum, ptr. update # loop
Q: How many instruction cycles per iteration?
Page 29 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
TMS320C6201 FIR Filter Inner Loop
LOOP: || || || || || ||[B0] ||[B0] ADD ADD MPYHL MPYLH LDW LDW ADD B .L1 .L2 .M1X .M2X .D2 .D1 .S2 .S1 A0,A3,A0 B1,B7,B1 A2,B2,A3 A2,B2,B7 *B4++,B2 *A7--,A2 -1,B0,B0 LOOP ; ; ; ; ; ; ; ; Sum0 += P0 Sum1 += P1 P0 = h(i)*s(i) P1 = h(i+1)*s(i+1) h(i) & h(i+1) s(i) & s(i+1) Cond. dec loop counter Cond. Branch to LOOP
Latencies:
Multiply: 2 cycles; load: 5 cycles; branch: 6 cycles
Predicated execution for all instructions.
Page 30 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
MMX Pentium FIR Filter Inner Loop
loop1: pmaddwd paddd pmaddwd paddd movq movq paddd pmaddwd paddd pmaddwd movq movq add add dec jnz mm0, COEFaddr[edi] mm7, mm2 mm1, COEFaddr[edi+8] mm7, mm3 mm2, [esi+16] mm3, [esi+24] mm7, mm0 mm2, COEFaddr[edi+16] mm7, mm1 mm3, COEFaddr[edi+24] mm0, [esi+32] mm1, [esi+40] edi, 32 esi, 32 ecx loop1 4 MADs (reg, mem) Complete earlier accum. next 4 MADs Complete earlier accum. Load next 4 data items Load next 4 data items Complete earlier accum. Again, with feeling (unrolled to avoid load-related stall)
Update coeff. ptr. Update data ptr. Decrement loop count. Branch to top of loop
Page 31 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
MMX Pentium FIR Filter Inner Loop
Latencies:
Multiply: 3 cycles (not 3 instructions)
Superscalar execution
Up to two instructions/cycle Can pair one simple MMX instruction with another simple or complex MMX instruction or non-MMX integer instruction Complicated pairing rules
Branch prediction Throughput: ~2 16-bit MACs/cycle
Page 32 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Processor DSP Speed
BDTImarks
80 70 60 50 40 30 20 10 0
First Generation 1982
Second Generation 1987
Third Generation 1995
Fourth Generation 1997
0.5
TMS320C10 5 MIPS
4
DSP56001 13 MIPS
13
TMS320C54x 50 MIPS
77
TMS320C6201 1200 MIPS
49
MMX Pentium 466 MIPS
Page 33 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
Conclusions
DSP processor performance has increased by a factor of about 150x over the past 15 years (~40%/year) Processor architectures for DSP will be increasingly specialized for applications, especially communications applications General-purpose processors will become viable for many DSP applications Users of processors for DSP will have an expanding array of choices Selecting processors requires a careful, application-specific analysis
Page 34 of 35 11/17/97 1997 Berkeley Design Technology, Inc.
The Evolution of DSP Processors
For More Information
http://www.bdti.com Collection of BDTIs papers on DSP processors, tools, and benchmarking. Links to other good DSP sites. For info on newer DSP processors. Textbook on DSP Processors Article on DSP Benchmarks Article on Choosing a DSP Processor
http://www.eg3.com/dsp
Microprocessor Report DSP Processor Fundamentals, BDTI
IEEE Spectrum, July, 1996
Embedded Systems Prog. October, 1996
Or, Join BDTI...
Were hiring (see www.bdti.com)
Page 35 of 35 11/17/97 1997 Berkeley Design Technology, Inc.