Class Presentation of
Custom DSP Implementation Course on:
ECE Department University of Tehran
TMS320C54x DSP
processor
Presented by:
Shahab adin Rahmanian
May 2005
This is a class presentation. All data are copy rights of their
respective authors as listed in the references and have been
used here for educational purpose only.
Outline
Introduction
Architecture
Applications
features
Instruction Set and addressing
FIR Filtering
Accelerating Polynomial Evaluation
Numerical Issues
Write code in C
Conclusion
Introduction
[2]
TMS320C54x
a fixed-point digital signal processor (DSP) in the TMS320 family.
Low power DSP
: 0.54 mW/MIP
Acceleration for FIR and LMS filtering, code book search,
polynomial evaluation, Viterbi decoding ,Fast Fourier transform
[4]
Some Typical Applications
General-Purpose
Adaptive filtering
Digital filtering
Fast Fourier transforms
Control
Disk drive control
Laser printer control
Robotics control
Military
Missile guidance
Radar processing
Secure communication
Telecommunications
1200- to 19200-bps modems
Adaptive equalizers
Cellular telephones
Echo cancellation
Video conferencing
Software Applications
Circular Buffers
Single-Instruction Repeat (RPT) Loops
Extended-Precision Arithmetic
Addition and Subtraction
Multiplication
Division
Square Root
Floating-Point Arithmetic
Application-Oriented Operations
Symmetric FIR Filters
Adaptive Filtering
Viterbi Algorithm for Channel Decoding
Fast Fourier Transforms
Some key features
CPU
Advanced multi bus architecture with three separate
16-bit data buses and one program bus
40-bit arithmetic logic unit (ALU), including a 40-bit
barrel shifter and two independent 40-bit
accumulators
17-bit 17-bit parallel multiplier coupled to a 40-bit
dedicated adder for non-pipelined single-cycle
multiply/accumulate (MAC) operation
Memory
192K words 16-bit maximum addressable memory
space (64K words program, 64K words data, and 64K
words I/O)
28K words 16-bit single-access on-chip ROM with
8K words configurable as program or data memory
(C541 only)
Some key features
On-chip peripherals
On-chip phase-locked loop (PLL) clock
generator with internal oscillator or external
clock source
Two full-duplexed serial ports to support 8and 16-bit transfers (C541only)
Time-division multiplexed (TDM) serial port
(C542/C543 only)
One 16-bit timer
Speed: 25/20-ns execution time for a singlecycle fixed-point instruction (40 MIPS/50 MIPS)
with 5-V power supply
C54x Addressing Modes
Immediate
Operand is part of
the instruction
ADD #0FFh
Absolute
Address of operand
is part of the
instruction
Register
Operand is
specified in a
register
LD *(LABEL), A
READA DATA
;(data read
from address in
accumulator A)
C54x Addressing Modes
Direct
Address of operand is part
of the instruction (added
to implied memory page)
ADD 010h,A
Indirect
Address of operand is
stored in a register
Offset addressing
Register offset (ar1+ar0)
Autoincrement/decrement
Bit reversed addressing
Circular addressing
ADD *AR1
ADD *AR1(10)
ADD *AR1+0
ADD *AR1+
ADD *AR1+B
ADD *AR1+0B
C54X Instructions Set by Category
Arithmetic
ADD
MAC
MAS
MPY
NEG
SUB
ZERO
Logical
AND
BIT
BITF
CMPL
CMPM
OR
ROL
ROR
SFTA
SFTC
SFTL
XOR
Program
Control
B
BC
CALL
CC
IDLE
INTR
NOP
RC
RET
RPT
RPTB
RPTZ
TRAP
XC
Application
Specific
ABS
ABDST
DELAY
EXP
FIRS
LMS
MAX
Data
MIN
Management
NORM
LD
POLY
MAR
RND
MV(D,K,M,P)
SAT
ST
SQDST
SQUR
Notes
SQURA
CMPL complement
MAR modify address reg.
SQURS
CMPM compare memory
MAS multiply and subtract
Block FIR Filtering
y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(n-1)]
; Addresses:
a4 as
h, linear
a5 N samples
a6 input (in
buffer,
a7mem.)
output
h stored
array ofofNx,elements
prog.
buffer
x stored
as circular
array
of N
data
mem.)
; Modulo
addressing
prevents
need
to elements
reinitialize(in
regs
each
sample
; Moving filter coefficients from program to data memory is not
shown
firtask: ld
#firDP,dp
; initialize data page
pointer
stm
#frameSize-1,brc
; compute 256 outputs
rptbd firloop-1
stm
#N,bk
; FIR circular buffer size
ld
*ar6+,a
; load input value to
accumulator b
stl
a,*ar4+%
; replace oldest sample
with newest
rptz
a,#(N-1)
; zero accumulator a, do
N taps
mac
*ar4+0%,*ar5+0%,a; one tap, accumulate in a
Accelerating Symmetric FIR Filtering
Coefficients in linear phase filters are either
symmetric or anti-symmetric
Symmetric coefficients using 2 mults 3 adds
y[n] = h0 x[n] + h1 x[n-1] + h1 x[n-2] + h0 x[n-3]
y[n] = h0 (x[n] + x[n-3]) + h1 (x[n-1] + x[n-2])
Accelerated by FIRS (FIR Symmetric) instruction
x in two
circular
buffers
h in
program
memory
Accelerating Symmetric FIR Filtering
; Addresses: a6 input buffer, a7 output buffer
; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8
; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8
; Modulo addressing prevents need to reinitialize regs each
sample
firtask:
ld
#firDP,dp
; initialize data page
pointer
stm #frameSize-1,brc
; compute 256 outputs
rptbd
firloop-1
stm #N/2,bk
; FIR circular buffer size
ld *ar6+,b
; load input value to accumulator b
mvdd
*ar4,*a5+0% ; move old x[n-N/2] to new x[nN/2-1]
stl b,*ar4%
; replace oldest sample with newest
add *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]
rptz b,#(N/2-1)
; zero accumulator b, do N/2-1 taps
firs *ar4+0%,*ar5+0%,coeffs
; b += a * h[i], do next a
mar *+a4(2)%
; to load the next newest sample
mar *ar5+%
; position for x[n-N/2] sample
sth b,*ar7+
firloop:
ret
Architecture - FIRS
Accelerating Polynomial Evaluation
Function approximation and spline interpolation
Fast polynomial evaluation (N coefficients)
y(x) = c0 + c1 x + c2 x2 + c3 x3
Expanded form
y(x) = c0 + x (c1 + x (c2 + x (c3))) Horners form
POLY reduces 2 N cycles using MAC+ADD to N cycles
; ar2 contains address of array [c3 c2 c1 c0]
; poly uses temporary register t for multiplicand x
; first two times poly instruction executes gives
; 1. a = c(3) + x * 0 = c(3); b = c2
; 2. a = c(2) + x * c(3);
b = c1
ld *ar2+,16,b
; b = c3 << 16
ld *ar3,t
; t = x (ar3 contains addr of x)
rptz a,#3
; a = 0, repeat next inst. 4
times
poly *ar2+
; a = b + x*a || b = c(i-1) << 16
sth a,*ar4
; store result (ar4 is addr of y)
Integer Multiplication
Integer multiplication yields products larger than the inputs, as
can be seen in the example below, using single digit decimal
values as inputs:
Does the user store the lower (1) or upper (8) result?
Both must be kept, resulting in additional resources (two
cycles ,words of code, and RAM locations) to complete the
store.
Worse, how can the double-sized result be used recursively as
an input in later calculations, given that the multiplier inputs
an input in later calculations, given that the multiplier inputs
are single-width?
Fractional Multiplication
Multiplication of fractions yields products that never exceed
the range of a fraction, as can be seen in the example below,
using single digit decimal fractions as inputs:
Dont we still have a double sized result to store?
In this case, we can store just the upper result (.8)
This allows storage of result with fewer resources
Results may be used recursively
Has accuracy been lost by dropping the lower accumulator value?
Accuracy vs. Precision
Often the programmer wants to retain the fullest
accuracy of a calculation, thus dropping the 16
LSBs of the result in the previous example seems a
bad choice.
Note though, the inputs: how much accuracy do
they offer?
The product offers double precision but its
accuracy is based on the single-width inputs.
Thus, storing a single precision result is not only an
efficient solution, but represents the limit of the
accuracy of the result.
The accumulator is double-sized for two reasons:
To allow for integer operations, which would
possibly require the LSBs for the result.
So that sum-of-product operations will generate
accumulative noise at the 32nd vs. the 16th bit.
Redundant Sign Bit
Multiplication of two signed
numbers yields product with
two sign bits
Extra sign bit causes
problems if stored to memory
as result:
Wastes space
Creates off-size Q
Solution: Fractional mode
bit!
When FRCT (mode bit in
ST1) is set, the multiplier
output is left-shifted by one
For 16-bit C54x:
Q1 5*Q1 5=Q1 5
Accumulation
With fractions, we were able to guarantee that
no multiplicative overflow could occur, ie:
F*F<=F.
For addition, this rule does not apply, ie: F+F>F.
Therefore, we need additional measures to
manage the possibility of overflow for
accumulation. Two general methods apply:
Guard Bits: the C54x offers an 8-bit
extension above the high accumulator to
allow valid representation of the result of up
to 256 summations.
Non-gain Systems: offer additional criteria
that allow a simple solution for unlimited
length summations.
Guard Bits and saturation
Guard Bits: the C54x offers an 8-bit extension above
the high accumulator to allow valid representation of
the result of up to 256 summations.
Saturation (SAT)
SAT instruction saturates value exceeding
32-bit range in the selectedSAT
accumulator:
A
SAT B
Non-gain Systems
Many systems can be modeled to have no DC gain:
Filters with low Q.
Any systems scaled by its maximum gain value.
Input values from A/D converters are automatically
fractions, if the limits of the A/D are presumed to be +/-1
Coefficient values can similarly bonded by making the
largest value the scaling factor for all other values.
For these systems, it is known that the final value of the
process is less than or equal to the input values.
The accumulator therefore can be allowed to temporarily
overflow, since the final result is known to be bonded +/1.
Allows maximum usage of selected A/D and D/A
converters
D/A bits for gain are more expensive than using analog
components
Division
The C54x does not have a single cycle 16-bit divide
instruction
Divide is a rare function in DSP
Division hardware is expensive
The C54x does have a single cycle 1-bit divide
instruction: conditional subtract or SUBC
Preceded by RPT #15, a 16-bit divide is performed
Is much faster than without SUBC
The SUBC process operates only on unsigned operands,
thus software must:
Compare the signs of the input operands
If they are alike, plan a positive quotient
If they differ, plan to negate (NEG) the quotient
Strip the signs of the inputs
Perform the unsigned division
Attach the proper sign based on the comparison of the
inputs
Division Routine
B = num*den (tells sign)
Strip sign of numerator
Strip sign of denominator
16 iterations
1-bit divide
If result needs to be
negative
Invert sign
Store negative result
Rounding
Result of multiplication can be rounded for MPY,
and MAS operations. This is specified by appending the
instruction with an R suffix.
Example: MAC with rounding is MACR. Rounding consists of
adding 215 to the result and then clearing the low accumulator.
In a long sum-of-products, only the last MAC operation should
specify rounding:
Rounding can also be achieved with a load
operation:
Sign Extension (SXM)
Write code in C
Inline Assembly
Allows direct access to assembly language from C
Useful for operating on components not used by
C, ex:
Note: first column after leading quote is label field
Long operations should be written in ASM and called
from C
main C file retains portability
yields more easily maintained structures
eliminates risk of interfering with registers in use by C
Accessing MMRs from C
Using pointers to access Memory-Mapped
Registers:
Create a pointer and set its value to the assigned memory
volatile
unsigned int *SPC_REG = (volatile unsigned int *) 0x0
address:
Read
and write to the register as any other pointer:
*SPC_REG=OxC
8;
Accessing I/O Ports from C
1. create the port:
2. access the port:
ioport unsigned
port8000
x = port8000;
port8000 = y;
Summary and Conclusion
C54x is a conventional digital signal processor
Separate data/program busses (3 reads & 1
write/cycle)
Extended precision accumulators
Single-cycle multiply-accumulate
Saturation and wraparound arithmetic
Bit-reversed and circular addressing modes
C54x has instructions to accelerate algorithms
Communications: FIR & LMS filtering, Viterbi decoding
Speech coding: vector distances for code book search
Interpolation: polynomial evaluation
References
[1] Texas instrument TMS320C54x DSP Design
Workshop
May 1997
[2] TMS320C54x Users guide
[3] www.ti.com
[4] SIGNAL AND IMAGE PROCESSING ON THE
TMS320C54x DSP by Prof. Brian L. Evans
[5] TMS320C54x Assembly Language Tools