DSP
Architectures
 A Digital Signal Processing System
Analog                                   D                    Analog
         Antialiasing    Sample                    Reconst.
Signal      Filter      and Hold   A/D   S   D/A    Filter
                                                              Signal
  in                                                           out
                                         P
A perspective of the Digital Signal Processing problem
                                Application areas
       Medical     Radar        Speech               Seismic        Image
                                           •••
                        Digital signal processing theory
          Theoretical
           problem             Basic functions
          modelling
                  Algorithms
                                                   Architechtures
                                   Processor
         Implementation         instruction sets
                               and/or hardware
                                   functions
                           Component technology
     DSP APPLICATION CHARACTERISTICS
APPLICATION REQUIREMENT   PROCESSOR ATTRIBUTES
REAL-TIME PROCESSING      HIGH SPEED, HIGH THROUGHPUT
LARGE ARRAY OF DATA       INSTRUCTIONS TO MOVE AND
                          PROCESS LARGE DATA ARRAYS
ALGORITHM INTENSIVE       FAST MATHEMATICAL
                          COMPUTATIONS, SINGLE CYCLE
                          DSP OPERATIONS (MACD)
SYSTEM FLEXIBILITY        GENERAL PURPOSE
                          PROGRAMMABILITY, EPROM,
                          MC/MP MODE
        Different approaches to hardware
                  implementation
1. HIGH SPEED GENERAL PURPOSE COMPUTERS
 Programmable        Expensive
 Can be configured for    Complex control
 different applications   I/O overheads
2. CUSTOM-DESIGNED VLSI COMPONENTS
 Efficient design   Application specific
 Large throughputs High development cost
3. GENERAL PURPOSE DIGITAL SIGNAL PROCESSORS
 Combine the Programmability & Control features of general
 purpose computers and the Architectural innovations of
 special purpose chips.
  GOALS: HIGH SPEED, LOW POWER AND LOW COST
   General purpose computers
1. Flexible
2.    Suitable for Internet and
  Multimedia application
3.    Software Intensive
4.    Slow for high speed application
5.    Too bulky
6.    Power hungry
    Why are conventional Processors not
            suitable for DSP?
      Caches are a waste of chip area
      Small register files force lots of memory
accesses
              - these are different from cache since
              these are program managed
      Complex instruction issue logic, branch
prediction,               speculation etc. are not
needed for DSP
      Not enough ALU function
             Data Processing vs
             Signal Processing
• General-purpose microprocessors are designed
  primarily for Data Processing.
   – The primary burden is Data Read/Write
• Digital Signal Processors are Microprocessors
  specifically designed for Signal Processing.
   – The primary burden is Mathematical operation
• DSP architecture therefore incorporates certain
  features not found in general-purpose P’s.
             DSP Requirements
• Emphasis is on mathematical operations rather than data
  manipulation operations like word processing, database
  management etc.
  – Design is optimized for DSP algorithms which   implement
    FIR filter, FFT generator etc.
• Processing is real-time, i.e. the input signal comes
  continuously, and the output signal is also produced
  continuously as the input is acquired.
• Dominant mathematical operation is Multiply and
  Accumulate (MAC), on separate inputs in parallel.
    Digital Signal Processor features
   Caters to high arithmetic demands
   Real time operation
   Analog input / output
   Large number of functional units for a
    given size
   Small control Logic
Typical MAC Execution Cycle
•   Obtain a sample of the Input signal
•   Move the input sample into the input buffer
•   Fetch the co-efficient from internal buffer
•   Multiply the input sample by the co-efficient
•   Add the product to the Accumulator
•   Move the output to the output buffer
•   Send it out as a sample of the output signal
              MAC Execution Hardware
                              Data Read
                Program        Address
                Counter
                              Data Write
                               Address
                          *
 Program/                                  Data
Coefficient                                Memory
  Memory
                CPU           ACC
   Architecture of Digital Signal
            Processors
• General-purpose processors are based on the Von
  Neumann architecture (single memory bank and
  processor accesses this memory bank thro’ single
  set of address and data lines)
• Harvard architecture commonly used in DSP
  processors
  – Separate Data and Program memories (two memory
    banks)
  – Separate Address Buses for Data and Program memories
   Additional Features in a DSP
            Processor
• Instruction Cache and Pipelined processor
  as in any modern microprocessor, but no
  Data Cache
• Separate ALU, Multiplier and Shifter,
  connected through multiple internal data
  buses, enabling fast MAC operations
        DSP, CISC and RISC
• DSP Processors can’t be called truly as
  CISC or RISC-type of processors
• Some features present in a RISC processor
  may exist. However, DSP processors are
  “tuned” towards operations encountered in
  signal processing applications
    DSP IMPLEMENTATION APPROACHES
Important desirable characteristics
   Adequate word length
   Fast multiply & accumulate
   High speed RAM
   Fast Coefficient table addressing
   Fast new sample fetch mechanism
DSP functions implemented with
           IC chips
Issues:
Speed Architectural features
Accuracy Register lengths and    floating
 point capability
Cost   Advances in VLSI techniques
      GENERAL PURPOSE DSP FEATURES
1. PARALLELISM:        Multiple Functional Units
   Multiple Buses
   Multiple Memories
2. PIPELINING
3. HARDWARE MULTIPLIERS AND OTHER         ARITHMETIC
   FUNCTIONS
4. ON-CHIP AND CACHE MEMORIES
5. A VARIETY OF ADDRESSING MODES
7. INSTRUCTIONS THAT PACK SEVERAL OPERATIONS
8. ZERO-OVERHEAD LOOPING
9. I/O FEATURES SUCH AS INTERRUPT, SERIAL      I/O, DMA
10.      OTHER CONTROL FUNCTIONS SUCH AS WAIT        STATES
         A second order FIR filter
x(n)                  x(n-1)                   x(n-2)
         Delay                     Delay
          h(0)                      h(1)                 h(2)
                                                        y(n)
                        +                       +
       y(n) = h(0)x(n) + h(1)x(n -1) + h(2)x(n-2)
          x(n+1)
            x(n)      Delay               h(0)
           x(n-1)                         h(1)
                      Delay
  ar1      x(n-2)               ar2       h(2)
                           MAC
                            y(n)
Organization of signal samples and filter coefficients
   for a second order FIR filter implementation
  An Nth order FIR filter implementation
A[0]                            X[n]
A[1]                            X[n-1]
A[2]                            X[n-2]
                 *
 ••                             ••
 ••              P              ••
 ••                             ••
A[N-1]                +
                               X[n-N+1]
                     ACC       y[n]
Coefficient                    Data
Memory                         Memory
              FIR Filter pseudo-code
  Load loop count
  Initialize coefficient and data addr      regs
  Zero Acc and P registers
LOOP: Pnew      = A[i] . X[n-i]
  Accnew = ACCold + Pold
  Decrement coefficient and data addr regs
  X[n-i]    X[n-i-1] {for next iteration}
  Decr loop count
  BNZ LOOP
  Acc        Y[n]
            A Typical DSP Architecture
 
                             PM Data           DM Data
                PM Address    Address           Address    DM Address
    Program
    Memory                   Generator         Generator                   Data
     (PM)                                                                Memory
                                                                          (DM)
     Instruc-                    Program Sequencer
     tions &                 Instruction Cache                             Data
    secondary   PM Data                                     DM Data
                                                                           only
       data
                                     Registers                   DMA Bus
                                                                            I/O
                                    Multiplier                           Controller
                                                                          (DMA)
                                         ALU
                                         Shifter
                                                                        Input/Output
               Salient Features
• REPEAT-MAC instruction
  -     Performs auto-increment of both coefficient and data
  pointers
  -     Frees up program memory bus for fetching coefficients
• Circular buffer
  -    to manage data movement at the end of every     output
  computation
• Handling precision
  -     Accumulator guard bits
  -     Saturation mode
  -     Shifters (both right and left shift)
     Types of multipliers used
• Array multipliers
• Multipliers based on modified Booth’s
  algorithm
Product Computation Unit of a
  simple multiplier for 4-bit
  unsigned numbers X and Y
Multiplying X and Y: summation
 unit of the simple multiplier
Combinational array for Booth’s
   algorithm – Basic cell B
Array Multiplier for 4x4-bit
numbers using basic cell B
                     Arithmetic
Fixed point Vs      Floating point
Array indices, Loop Wider dynamic range
counters etc. frees user from scaling concerns
         Less sensitive to error accumulation
Overflow/underflow 50% slower for same
management technology
Error budget for    Higher Cost
word length growth Normalize after each      operation
  Mantissa round off (some      accuracy is traded)
Fixed point does not always limit performance:
e.g., for dynamic range of 50 to 60 dB, 12 -bit
quantization (step size of -72 dB) is more than
adequate. For Hi-fi audio with 80 dB dynamic
range, 16 bits (-96 dB) are more than adequate
              Overflow Management
SHIFT
 Left shift removes redundant sign bit after 2’s complement
 multiplication
 Right shift down scales numbers as word growth is detected
Unbiased rounding
 Prevents accumulation of a small dc bias from outputs which
 fall just half way between adjacent rounded values
Saturation Logic
     Sets the contents of register to maximize the
     value if overflow occurs
Block Floating Point
     Scaling logic + exponent register: If overflow
     condition of any point is detected, the entire
     array is rescaled downwards and the scaling is
     stored in the block exponent register.
                 SHIFTERS
- Scales numbers to prevent overflow/underflow
- Conversion between fixed point and floating point
- Many bits must be shifted in a single cycle to preserve
  single cycle computational speed (Barrel Shifter)
- Logical shift assumes unsigned data and fills with
  zeroes left or right
- Arithmetic shift scales numbers upwards (left) or
  downward (right)
zero fills     sign extend
- Normalization/de-normalization for block floating point
                      Memory
Traditional µPs       :     register to register
                            (limited memory bandwidth)
DSPs : memory to memory
            (higher memory bandwidth)
      upto six memory fetches in an inst. cycle
      Parallel memory banks: small, fast and simple
      memories.
Internal Vs External
Pincount limitation
Speed penalty              Off-chip bussing
      Internal busses are multiplexed to the outside.
      Expand only one memory off-chip.
          MEMORY ORGANISATION - I
          BASIC HARVARD ARCHITECTURE
             PROGRAM     DATA
             MEMORY     MEMORY
  MODIFICATION #1                MODIFICATION#2
PROGRAM                                  MULTI-PORT
              DATA          PROGRAM
  /DATA                                    DATA
             MEMORY         MEMORY
MEMORY                                    MEMORY
                    MEMORY ORGANISATION - II
                            MODIFICATION #3
                            PROGRAM/
              PROGRA                          DATA
                              DATA
              M CACHE                        MEMORY
                             MEMORY
                           MODIFICATION #4
             PROGRA          DATA             DATA
               M            MEMORY           MEMORY
             MEMORY
                           MODIFICATION #5
   I/O    PROGRAM                     DATA             DATA     DATA
                         DATA
PROGRAM   MEMORY                     MEMORY           MEMORY   MEMORY
                        MEMORY
MEMORY
            Addressing modes
Parallel memory   inst. must specify upto 3 memory
                  accesses
Number of bits required is very large
More memory, wider busses, more memory cycle
Solution:   register - indirect addressing modes
            Many addresses in one word inst.
            Only a few bits are required since register
            bank is small ; parallel hardware to update
            registers containing memory addresses.
      Addressing Modes (contd.)
•   Short immediate addressing mode
•   Short direct addressing mode
•   Memory mapped addressing mode
•   Circular buffer addressing mode
•   Bit-reversed addressing mode
           Instruction Level Parallelism
VLIW architecture
• Each instruction specifies several operations to
  be done in parallel
• Advantages      : Simple hardware
                      compilers can spot ILP
  easily
• Disadvantages : Little compatibilty between
                           generations
                     Explicit NOPs bloat code
       Super scalar architecture
• Hardware responsible for finding ILP in a
  sequential program
• Advantage    : Compatibility between
                    generations
• Disadvantage : Very complex hardware
 Explicitly Parallel Instruction Computing
                   (EPIC)
• Combines VLIW and super scalar
  architectures
• Instructions are grouped into 3 operating
  blocks and a template block
• Template block tells hardware if
  instructions can be executed in parallel
• Also gives information whether the block
  can be executed in parallel
          ILP versus Power
Increasing instructions / cycle
 Requires fewer cycles to execute a task
 Uses longer clock for same performance
 Uses lower supply voltage
 And hence uses less power
 However, too many functional units and too
 many transitions per clock cycle increase
 power consumption.
              Low Power architecture
 Power consumed by additional circuits vs. ability to
  lower clock rate while maintaining performance
 Circuits must be highly used
 Move complexity into software
 Voltage scaling : Reduce Vdd
 Clock gating     : Turn off clock when chip
          is not in use ( applies to
          sub-modules of chip also)
         VLIW is more suitable than super scalar
    for   low power
          - VLIW is smaller for same number of
    functional units
          - Compiler is better at finding parallelism
    than hardware
         Put multiple processors on chip rather than
          lots of functional units in one processor
         Helps in running independent tasks
       Improvement of Speed by
             Pipelining
• Processor speed can be enhanced by having separate
  hardware units for the different functional blocks,
  with buffers between the successive units.
  – The number of unit operations into which the instruction
    cycle of a processor can be divided for this purpose
    defines the number of stages in the pipeline.
  – A processor having an n-stage pipeline would have up to
    n instructions simultaneously being processed by the
    different functional units of the processor.
• Effective processor speed increases ideally by a
  factor equal to the number of pipelining stages.
A Four-stage Pipeline
    Data Dependency in Pipelining
If the input data for an instruction depends on the
outcome of the previous instruction, the Write cycle of
the previous instruction has to be over before the
Operate cycle of the next instruction can start. The
pipeline effectively idles through one instruction,
creating a bubble in the pipeline which persists for
several instructions.
     F1     D1     O1    W1
             F2     D2   idle      O2   W2
                    F3   idle      D3   O3   W3
                          Bubble
                           ends
                                   F4   D4   O4    W4
                           here
     Example of dependency
• A  3 + A; B  4 x A
  Can’t perform these two in parallel
• Another case: A = B + A; B = A – B; A =
  A – B (swapping without temp) ; examine
  how you can handle this.
Branch Dependency in Pipelining
A Branch instruction can cause a pipeline stall if the branch
is taken, as the next instruction has to be aborted in that
case. If I1 is an unconditional branch instruction, the next
Fetch cycle (F2) can start after D1. But if I1 is a conditional
branch instruction, F2 has to wait until O1 for the decision as
to whether the branch will be taken or not.
  F1     D1      O1   W1       branch instruction
        F2      D2     O2     W2 executed if branch is not taken
                                                 executed for
                F2      D2    O2     W2
                                             unconditional branch
                       F2     D2       O2    W2       for conditional
                                                     branch, if taken
        Pipeline in ADSP 219x
              Processors
6-Stage Instruction Pipeline with Single-cycle
  Computational Units:
• Look-Ahead: places the address of an instruction that is
  going to be executed several cycles down the road, on
  the program-memory address (PMA) bus
• Pre-fetch: Pre-fetches an instruction if the instruction
  is already in the instruction cache
• Fetch: Fetches the instruction that was “looked ahead”
  2 cycles ago
• Address-decode: Decoding of the DAG operand fields
  in the opcode in this cycle
• Decode: The second stage of the instruction decoding
  process, where the rest of the opcode is decoded
• Execute: Instruction is executed, status updated, results
  written to destinations
          Causes for Pipeline Stalls
 Memory block conflicts: If both instruction and data are to be
   fetched from the same block of memory, a stall is
   automatically inserted
 DAG usage immediately (or within 2 cycles) after
   initialization. e.g.
                                I2 = 0x1234;
                         AX0 = DM(I2,M2);
 Bus conflicts: Instructions which use the PMA/PMD buses for
   data transfer may cause bus conflict. e.g.
                                PM(I5,M7)=M3;
     Avoiding DAG-related Pipeline
                Stalls
• Note that
 –        I2 = 0x1234;
 –           I3 = 0x0001;
 –           I1 = 0x002;
 –        AX0 = DM(I2,M2);
will NOT cause a stall.
• Also, note that switching DAG register
banks (primary  secondary) immediately
before using them will NOT cause a stall.
                       TMS320C25 KEY FEATURES
                +5 v        GND
                                     DATA
                                               100 ns INSTRUCTION CYCLE TIME
INTERRUPTS    288 x 16    256 x 16
             DATA RAM      DATA/               128K-WORDS TOTAL MEMORY
                                      16
                         PROGRAM
                                      MULTI-    SPACE
                                      PROCESSOR
                                      INTERFACE
                                               THREE PARALLEL SHIFTERS
             4K x 16 PROGRAM ROM
                                     SERIAL    133 GENERAL PURPOSE AND DSP
                                     INTERFACE
                                                INSTRUCTIONS
                32-BIT ALU/ACC
                                               S/W UPWARD COMPATIBLE WITH
                                     ADDRESS   PREVIOUS FAMILY MEMBERS
                   16 x 16
                 MULTIPLIER          16        1.8u CMOS: 68-PIN PLCC / PGA
   TMS320C25 GENERAL PURPOSE FEATURES
                  COMPREHENSIVE INSTRUCTION SET-
                  133 INSTRUCTIONS INCLUDING
                         - NUMERICAL (34)
X=X-Y                    - LOGICAL (15)
                         - MEMORY MANAGEMENT (33)
                         - BRANCHES (20)
                         - PROGRAM/MODE CONTROL (31)
                  EXTENDED-PRECISION ARITHMETIC
                  SERIAL PORT (DOUBLE BUFFERED,
 BIT 16=0         STATIC)
             NO
                  MULTIPROCESSOR INTERFACES
                  (CONCURRENT DMA, GLOBAL DATA
                  MEMORY)
       YES
                  BLOCK MOVES (UP TO 10 M WORDS/SEC)
                  ON-CHIP TIMER
OUTPUT X
                  THREE EXTERNAL MASKABLE
                  INTERRUPTS
                  POWERDOWN MODE
TMS320C25 ALU
          DESIGN & OPERATION
       32-BIT ALU & ACCUMULATOR
       CARRY BIT FOR EXTENDED
       PRECISION
       OVERFLOW DETECTION &
       SATURATION
       SIGN EXTENSION OPTION
       0-16 BIT PARALLEL SHIFTER
       FOR LOADS AND ARITHMETIC
       OPS
       SHIFTERS ON PRODUCT
       REGISTER OUTPUT DATA
       0-7 BIT PARALLEL SHIFTER FOR
       ACCUMULATOR STORES
TMS320C25 - MULTIPLY INSTRUCTIONS II
                MAC   MPY data memory * program
                      memory & add past P-Reg to
                      ACC
                MACD MPY data memory * program
                     memory, add past P-Reg to
                     ACC, & move data memory
                SQRA Square data memory value &
                     add past P-Reg to ACC
                SQRS Square data memory value &
                     sub past P-Reg from ACC
            TMS320C25 SPECIAL PURPOSE FEATURES
Xi                                                SINGLE-CYCLE
      Z-1     Z-1     Z   -1
                                  Z -1
                                                  MULTIPLY/ACCUMULATE
                                                  MULTIPLY/
                                                  ACCUMULATE USING
                                                  EXTERNAL PROGRAM
                                                  MEMORY
                                                  REPEAT INSTRUCTION
                                             Yi
                                                 ADAPTIVE FILTERING
                                                  INSTRUCTIONS
     0-16 BIT SCALING SHIFTER (SIGNED            BIT-REVERSED
     OR UNSIGNED)                                 ADDRESSING
     OVERFLOW MANAGEMENT                         AUTOMATIC DATA-
           -SATURATION MODE                       MOVE IN MEMORY (Z-1)
           -BRANCH ON OVERFLOW
           -PRODUCT RIGHT SHIFT
TMS320C25 - HIGHER PERFORMANCE AT LESS CODE SPACE
xn
     Z-1       Z-1               Z-1                 Z-1
           x           x                 x                  x
                                                                      Yn
                                                                
                             N
                     Yn =   b     K   X(n-K)   TMS320C25
                            K=0
                                                RPTK 49
                                                 MACD
                                                3 WORDS PROG MEMORY
                                                53 CYCLES
         TMS320C25 ADDRESSING MODES
 IMMEDIATE ADDRESSING     - BOTH LONG AND SHORTProgram memory
  CONSTANTS - EXAMPLES:                          ADDK           5
  ADDK    5
                                                     ADDLK
  ADLK > 1325
                                                         1325
 DIRECT ADDRESSING         - SAME AS TMS320C1X BUT DP
  IS 9 BITS   - 512 “BANKS” OF 128 WORDS       - USED         From
  OFTEN FOR LONG SEQUENCES          OF IN-LINE CODEDP      instruction
                                                9 BITS      7 BITS
 INDIRECT ADDRESSING           - B AUXILIARY
  REGISTERS   - USED OFTEN IN PROGRAM LOOPS WITH
                                              OPERAND ADDRESS
  AUTO INC/DEC OPTIONS
    Addressing Mode (contd.)
• Circular buffer addressing mode
• Bit-reversed addressing mode
TMS320C25 AUXILIARY REGISTER INSTRUCTIONS
                        LAR    Load aux-reg w/data
                        LARK   Load AR w/8-bit constant
                        LRLK   Load AR w/16-bit constant
                        MAR    Modify auxiliary register
                        SAR    Store auxiliary register
                        ADRK Add 8-bit constant to AR
                        SBRK   Sub 8-bit constant from
                               AR
                        LARP   Load auxiliary register
                               pointer
TMS320C25 ON-CHIP MEMORY
                MEMORY ORGANIZATION
                4K WORDS ON-CHIP
                MASKED ROM
                544 WORDS ON-CHIP
                DATA RAM
                256 WORDS ON-CHIP
                RAM RECONFIGURABLE
                AS DATA/PROGRAM
                MEMORY
                BLOCK TRANSFERS IN
                MEMORY
                DIRECT, INDIRECT, AND
                IMMEDIATE
                ADDRESSING MODES
BLOCK DIAGRAM OF A TMS320C5X DSP
    General-Purpose Microprocessor
        circa 1984 : Intel 8088
     ~100,000 transistors
     Clock speed : ~ 5 MHz
     Address space : 20 bits
     Bus width : 8 bits
     100+ instructions
     2-35 cycles per instruction
     Micro-coded architecture
            DSP TMS 32010 1984
 Clock 20 MHz
 16 bits
 8, 12 bits addressing space
 ~ 50 k transistors
 ~ 35 instructions
 Harvard architecture
 Hardware multiplier
 Double length accumulator with saturation
 A few special DSP instructions
 Relatively inexpensive
General Purpose Microprocessor 2000
   GHz clock speed
   32-bit address or more
   32-bit bus, 128-bit instructions
   Complex MMU
   Super scalar CPU
   MMX instructions
   On chip cache
   Single cycle execution
   32-bit floating point ALU on board
   Very expensive
   10s of watts of power
                DSP in 2000
   Clock 100 ~ 200 MHz
   16-bit floating point or 32-bit floating point
   16-24 bits address space
   Large on-chip and off-chip memories
   Single cycle execution of most instructions
   Harvard architecture
   Lots of special DSP instructions
   50 mw to 2w power
   Cheap
       Future of DSP Microprocessor
    Sufficiently unique for an independent
     class of applications (HDD, cell phone)
    Low power consumption, low cost
    High performance within power, cost
     constraints (MIPS/mw, MIPS/$)
    Fixed point & floating point
    Better compilers - but users must be
informed
    Hybrid DSP/ GP systems