KEMBAR78
Archi Modelling | PPT
DSP CPU  Architecture  Modelling with Matlab  WHY?   DESIGN FLOW  INTRODUCTION FXP Issues MATLAB ISSUES Modeling Terms & Definitions TUTORIAL PROGRESS DSPMA1 VNEG ADD2 MODULO FUSED ADDERS ACCUMARRAY DIFF VPU DESIGN FILTER DESIGN  (IIR1,PMAVF256, LMS,NLMS) EQUALIZER
PPT TEMPLATE Design hierarchy Functional model level 3 is 20 Bit accurate model Time accurate model Pipeline model
TUTORIAL PROGRESS DSPMA1 overview of typical issues VNEG any type FXP uses INTxx any shape  vector, introduction to VU exercise: NEG modulo ADD2 MODULO we cover how to do modulo FUSED ADDERS more into parallelism ACCUMARRAY some architecture issues interesting problems DIFF to do VPU internal sequencer IIR1 introduction to all filters states PMAVF256: massive parallelism
WHY? Typical DSP architecture issues How much parallelism? multi-lane (2,4,8,16) study How wide?  Bit exact hardware, near-bit exact hardware MAC operator multiple units  mixing rounding and saturation  refer to ITU library (all single, all serial) Compound operators how to deal with multiple overflow? where to do saturation? Complex (as in A+iB) operators   DSP Functions where? Coprocessing issues MATLAB specific functions How? when?
MATHWORKS design flow Design hierarchy Functional model Bit accurate model Time accurate model Pipeline model Tool (LANGUAGE) name MATLAB SIMULINK VERILOG Alternative names Functional: behavorial Bit accurate: bit exact  Time accurate: phase accurate Pipeline: microarchitecture
TWO DESIGN FLOWS Design flows Behavioral model Bit accurate model Phase accurate model Pipeline model MATLAB Powerful environment more functions than any other languages integrated from top to bottom pipedream? SIMULINK Implicit timing and concurency Access tools built directly into models easily change parameters to model the FXP effects (rounding,overflow,scaling)
INTRODUCTION  Arithmetic issues and FXP problem 3 modal arithmetic types propagation mode MATLAB issues Modeling concepts
INTRO - Arithmetic issues and FXP problem  This is the FXP perspective 3 types of modal arithmetic   Modulo on overflow, wrap-around (max becomes 0)  32 + 32    32  Ex: C integer type, hardware  Saturated  on overflow, saturate (max+1 becomes max)  32 + 32    32  Matlab integer type (intXX)  Promoted  overflow is impossible (*)  (*) in the limits of the precision  32 + 32    33 (width increase) Ex: Floating point (*), 8-bit data in 256-bit, free hardware  Propagation Matlab: integer supersedes FP C: FP  supersedes integer
INTRO -  MODULO vs SATURATED ARITHMETIC  Matlab use saturated arithmetic Embedded in the (U)INTXX type Note that it is attached to the type not the operator. Saturation is easily added to an operator working in modulo mode Example 32-bit add  z = x+y overflow =  xor(z.32, z.31)  if (overflow) z = sat(z) as long that we have access to bit 32 (means larger width). Matlab offers the unusual problem of having to add modulo mode to a operator working in saturated mode.  Solution    Use promotion example 16-bit add first promote z= int32(x) + int32 (y)  if overflow, correct the sum by substracting power(2,16) if positive (resp. substraction for negative)  final operation, demote, z = int16(z) example 32-bit add first promote z= double(x) + double (y)  note that for 32-bit ; z= int64(x)+ int64(y) does not work  if overflow, correct the sum by substracting power(2,32) if positive (resp. substraction for negative)  final operation, demote, z = int32(z) Issues: The operation must not need more than double of bits. Bit operations (such as xor) only work on unsigned type.
INTRO -  MODULO vs SATURATED ARITHMETIC -TestBench  Modulo and Saturated differs only in case of overflow When can overflow happen? Direct demotion  long to short , short to byte Hidden demotion: promotion followed by demotion  ADD: 32+32 -> 33 -> 32 MUL: 16 x 16 -> 32  -> 16  SHIFT: (32 bit) << 8  -> (40 bit) -> 32-bit The saturation/modulo operation happens when demoted. but the overflow detection can be done in many places. The test cases can all be simplified to a single case of demotion  (no need to test add, shift,etc..)
INTRO - MATLAB issues  Index problem Index is not a problem for “for loops” Index is a problem for vector use C convention  for n= 0:N-1  z(n+MLABIDX) = -x(n+MLABIDX) MLABIDX=1 defined locally in each unit  For N-dim arrays  I don’t know yet  Int data type is saturated  How to have Integer using modulo arithmetic? Vector organization: Column or row? for instance accumarray requires a vector organised as Nx1, not 1xN Common src/dst or predicated functions: if (sign) x=abs(x) ; else x=x  Predicated functions will touch (or not) the destination in Matlab dst is always modified; impossible to implement  Matlab does not have pointers; arguments passed by values
INTRO - STRUCTURE OF A DSP UNIT DSP UNIT Behavioral model Bit accurate model Phase accurate model Debugging hooks MATLAB CODE
INTRO to MODELING: Differences between Bit and Phase accurate Bit accurate  f1 f2 f3 f4 Phase accurate  Phase accurate  stage 1 stage 2 stage 3 f1 f2 f3 f4 decoder switch case opcode opcode
TERMS & DEFINITIONS Type  Shape Flow processing Building Block  Top level unit naming conventions Arithmetic unit naming conventions Variable naming conventions internal to units global to testbench
TERMS & DEFINITIONS - types and shapes Type  FXP is the same as Q format Integer  all integer (no fractional part)  generally software types (short,long) limited to8,16,32 bit  in q format all integers are signed (8q0, 16q0, 32q0) q format could be extended to 25q0,17q0 etc.. fractional: number is all fractional (1q15,1q31) Shape scalar vector (1xN=row vector, Nx1=column vector) a 1x1 vector can be different from a scalar array (NXM) Matrix == Array with special properties (linear algebra)
TERMS & DEFINITIONS  - Flow processing Flow processing alternative names  Scheduling Types of flow processing sample (single sample) alternative name:sample by sample sample in, sample out streaming (several samples) alternative name: block by block processing data block: a block? a chunk? block processing (complete block) alternative name: full block processing, vector,array processing
TERMS & DEFINITIONS - Building Block Building block alternative name unit, function Complexity level Top level msi structural unit  Application level Structural unit arithmetic  vector Top level,LSI alu,bmu functional unit dsp function matlab function Processing unit alternative name: execution unit Vector Processing unit alternative name: VPU, Vector Unit
TERMS & DEFINITIONS  - Top level BB naming conventions unitnameNN where NN = complexity level  alu0, alu1, alu2, alu3, alu4 mac0, mac1,  mac2, mac3  bmu0, bmu1, bmu2 ffter1,ffter2,ffter4 cafir1, cafir2 unitnameNN where NN = brand name alu2901 mac1616 macSPARTAN-DSP48A
TERMS & DEFINITIONS - arithmetic unit naming conventions Type agnostic Multi i/o functionnameNNOO NN = number of inputs (input can be any width or type)  OO= number of outputs example: 15-input adder add151 note if signal is 32-bit wide, the total nbr of pins is big Type relevant 1vector to 1vector functionnameLLTT LL = number of lanes  TT= type (F,L,S) example: accumarray 16-lane 32-bit data type  accumar16_L
TERMS & DEFINITIONS – variable naming conventions x  y 2+1 UNIT z x  y  v 3+2 UNIT z  w a  b  c 3+2 UNIT d e x  y  v  a  b  c  d … M+N UNIT z  w  u  t  s .. a  b 2+1 UNIT c b  c internal vars= d e f g h.. 2+1 UNIT a
TERMS & DEFINITIONS – Testbench Global variable  naming conventions x  y PRINT_MANAGER z  w  zgolden  wgolden gold  z  maxerr  differ  bounds CHECKER_ MANAGER input buffers:  x  y  a  b  BUFFER_MANAGER output buffers:  z  w  u  t GENE_MANAGER (8 genes) geneA,B,C,D,E,F,G,H
OVERVIEW – DSPMA1 What is a CPU? a DSP? 4 major blocks  DPU, LSU, PCU, COP + register files  + MEM + I/O not all blocks need modeling  DSP is 90% datapath  Eventually it is better to model more  Memory hierarchy (at least level1) Program control
WORKSHOP1 - VNEG Why? General introduction to the power of Matlab typeless shapeless Dealing with parallelism: multi-lane is straightforward  ..but only on ‘pure parallel’ vector processing sequence of input data is independent (same with results) Workshop Target Introduction to the testbench style Working with vectors Introduction to vector/scalar issues
WK1 - VNEG (description) What? Negate building block  same code for all types and all shapes Examples:  1  z = neg(x)  z depends on type and shape of x  default is a matrix of FP complex numbers  2 x=int16(bufA); z = neg(x)  z  is a vector of shorts with the length of buffer A 3 bufA= gene(1:40) x=int16(bufA); z = neg(x) z  is a 40 x short  vector
WK1 - VNEG (design & issues) Initially this is a straightforward design with only minor cosmetics issues: Test bench style might not be the best fo everybody. Vector design is more involving Shape : input vector is cut in blocks  need right equation for computing the length of each block Type:  shall we keep it typeless?  modulo type might need more work than expected matlab intXX type is saturated.  ex: int16 ….. neg(32768)  -> -32767  in C code (or modulo)  int16 ….. neg(32768)  -> 0  to create modulo type, careful to keep  vectorized code  (do not use “if construct” to detect overflow).
WK1 - VNEG (list of files) Functions neg.m neg40.m neg40s.m  vneg40.m vneg40s.m Testbench files run.m mmain.m test_manager.m gene_manager.m
WORKSHOP2 -  ADD2 MODULO  Why? Some processing elements requires modulo  arithmetic checksum, coding Some DSP algorithms have better accumulator behavior in modulo arithmetic than in saturated arithmetic. Many DSP system designers prefer to use modulo instead of saturated arithmetic boys vs men? Workshop Target Introduction to vector/scalar issues
WK2 -  ADD2 MODULO  (description) What? 32-bit 2-input adder  (add2ml)  l for long identically 16-bit, 8-bit  s for short, b for byte 32-bit 2-input adder  (add2mul)  ul for unsigned long identically 16-bit, 8-bit  us for unsigned short, ub for unsigned byte Example:  z = add2ms(x,y)
WK2 -  ADD2 MODULO (design & issues) Input and output variables type: strongly typed :  unsigned/signed (long,short,byte) detection of overflow, needs width;  alternative is multi functional adder; this requires parameters which is silly (or this becomes an alu) shape: scalar Control parameters? none; this is the simplest of adder  Issues with Matlab  why not both scalar/vector? the if condition is easier to do in scalar mode. in vector mode needs use of find.
WK2 -  ADD2 MODULO (list of files) Functions add2ml.m add2ms.m add2mb.m add2mul.m add2mus.m add2mub.m Testbench files run.m mmain.m test_manager.m gene_manager.m print_manager.m
WK2 -  ADD2 MODULO (reviewing dfm work) Functions SHAPE:  all units are vector based TYPE: all units expects and returns a type add2ml.m:  promote to double, should be int64 but Matalb does not allow it. use of fix, to go from double to int add2mul.m:  same add2ms.m, add2mus.m nope add2mb.m:  promote to int32,could be int16   add2mub.m:  same Testbench files run.m mmain.m test_manager.m devectorization -> find better    buffer manager?? gene_manager.m sequences have larger width than width of operators; i.e short sequences are used to test add2 bytes; normally only byte sequences should be used; it is enough to create overflow. dp_manager.m : none print_manager: printf decimal is hardly readable for long; ulong uses hex but then hex is not exactly the answer for signed long.
WORKSHOP3 -  FUSED ADDERS  Why? To build multi-lane procesing elements you need fused adders e.g. : any function with sum(..)  accumulator= sum(x*k) sumabsdif = sum( abs(x-y)) Example: 16-lane  sumabsdif Workshop target:  improved test bench fp vs fxp How to test? use of checkers  with max errors quasi bit exact fused adder  (16 to 1)  lane 14 lane 15 lane 1 lane 0 …
WK3 -  FUSED ADDERS (description) What? 16 to 1 (add161) 8 to 1 (add81) 4 to 1 (add41) also 3-input adder (add31) Example:  z = add161(x)  fused adder  (16 to 1)  Z X X(16) X(1) Indices are Matlab
WK3 -  FUSED ADDERS (design & issues) Input variables a vector or separate variables? add161, add81    vector  add3 , add41    separate Control parameters? not needed in  basic units  Issues with Matlab  int16, uint16, int32 etc.. use saturated arithmetic    less flexibility in design  z = (x0+x1) + (x2+x3) is different from z= (x0+x2) + (x1+x3)
WK3 -  FUSED ADDERS (list of files) Functions add3.m add161.m add81.m add41.m Testbench files run.m mmain.m test_manager.m buffer_manager.m gene_manager.m dp_manager.m
WORKSHOP 4 – ACCUMARRAY Why? Taken as example of a Matlab extensive function  Example: 16-lane  accumarray Workshop target exploring several examples of fine-grain parallel implementations 2 to 16 lanes  fused adder  (16 to 1)  lane 14 lane 15 lane 1 lane 0 …
WK4 -  ACCUMARRAY (description) What? simple: 1-lane structure(accumar1) 2-lane structure  (accumar2) 4-lane structure (accumar4) 16-lane structure  (accumar16) Example:  z= accumarray(idx,x); MATLAB function z= accumar16(idx,x,xlength,NACC) X and Z= vector of length L;  L is a multiple of 16 Idx=index vector of length L  NACC =Number of ACCumulators  typically 10 Xlength= length of x accumar16 Z X idx xlength NACC
WK4 -  ACCUMARRAY (design & issues) Input variables always a vector;  length must be a multiple of number of lanes  Control parameters? Not always necessary  NACC: is a static parameter xlength: simplifies caller job but it means that the unit must have a sequencer to do the for loop.  Issues with Matlab  intXX.. adder use saturated arithmetic
WK4 -  ACCUMARRAY (list of files) Unit files accumar1.m accumar2.m 2 implementations accumar4.m accumar16.m Testbench files run.m mmain.m test_manager evolution: test_model1  gene_manager.m dp_manager.m evolution: dp_manager4    dp_manager3    dp_manager2
WK4 -  ACCUMARRAY (test bench evolution) Testbench files Model 0 run example given by Matlab on command line Model 1 = Model 0  + several examples  + test _manager  + some functional optimization + keep design simple 2- lane, 4-lane,16-lane (m-lane)  Model 2= Model 1  + structured test bench (based on script)  -> several separate files + gene_manager,  + golden model and check_manager + dp_manager some datapath optimization Model 3= Model 2  + fxp data + check is done with  = + range value, not bit exact. Model X =  model 2 +++ mapping accumar to a Vector processing unit
WK7 -  VPU What? 40-lane structure  (vneg40) X and Z= vector of length L;  L is a multiple of 40 Xlength= length of x vneg40 Z X xlength xstride bufX bufZ dst@ Matlab hidden  src@
WK7 - VPU (design & issues) Shape : input vector is cut in blocks  need right equation for computing the length of each block how many parameters shall we keep it typeless?
WORKSHOP 10 -  FILTER functions  Why? Filter are most common basic building blocks in DSP Benchmarks are based on filters Workshop targets standard methodology to all filters architecture issues  a working simple block mega block: pmavf256
WK10 -  FILTER functions (description) What? Filter building blocks Examples iir1 pmavf pmavf256 lms nlms
WK10 -  FILTER functions (design & issues) Defining and classifying filter bb  fir,cplxfir, lms, iir, biquad, lattice, acf Defining and classifying filter characteritics block, stream,  number of states(or taps) all parameters? flexibility Defining variable names input: bufin  states: x,w output: z,bufout  Is memory inside or outside datapath?  outside is better idea
WK10 -  FILTER functions (list of files) IIR1 project Unit files iir1.m iir1_q15.m Testbench files run.m,,mmain.m test_manager.m,gene_manager.m, dp_manager.m PMAVF project Unit files pmavf.m pmavfsec_q15.m pmavf256_q15.m Testbench files run.m,mmain.m test_manager.m, gene_manager.m,dp_manager.m LMS project  Unit files lms1.m lms1_q13.m lmsVVl.m Testbench files run.m, mmain.m test_manager, gene_manager.m, dp_manager.m,print_manager NLMS project  Unit files nlms.m lms units to be used as comparison Testbench files run.m, mmain.m test_manager, gene_manager.m, dp_manager.m,print_manager dp_manager1.m, dp_manager4.m ???
WK10 -  FILTER functions (further work) IIR1 project Unit files from 3 port to 2 port to 1 port units  (see single page doc) Testbench files; na PMAVF project Unit files optimised pmavf256_q15.m using spuds graph (see single page doc) Testbench files: na LMS project  Unit files: ?? Testbench files: na NLMS project  Unit files: get rid of lms units to be used as comparison Testbench files dp_manager1.m, dp_manager4.m what the heck???
WK11 -  MATLAB functions What? Matlab functions Issues Black box Understanding usage of functions Run examples How much compatibility? types of i/o (matrix,complex) all parameters? flexibility Internal box  how to get M-code? write its own code look for librairies (C,) octave,scilab
WK11 -  EQUALIZER Why? A good example of applying a DSP function Seen in:  NLMS tips and tricks sigMag proc. Optical receiver  What? NLMS: from equations to matlab code Matlab functions: architecting  defining an equalizer: normlms,lineareq running an equalizer my former work (equalTDMA, equal): from Matlab code to FXP code others equalizer Z X nbr of taps bufY bufX bufZ dst@ src@ Y adaptfir type

Archi Modelling

  • 1.
    DSP CPU Architecture Modelling with Matlab WHY? DESIGN FLOW INTRODUCTION FXP Issues MATLAB ISSUES Modeling Terms & Definitions TUTORIAL PROGRESS DSPMA1 VNEG ADD2 MODULO FUSED ADDERS ACCUMARRAY DIFF VPU DESIGN FILTER DESIGN (IIR1,PMAVF256, LMS,NLMS) EQUALIZER
  • 2.
    PPT TEMPLATE Designhierarchy Functional model level 3 is 20 Bit accurate model Time accurate model Pipeline model
  • 3.
    TUTORIAL PROGRESS DSPMA1overview of typical issues VNEG any type FXP uses INTxx any shape vector, introduction to VU exercise: NEG modulo ADD2 MODULO we cover how to do modulo FUSED ADDERS more into parallelism ACCUMARRAY some architecture issues interesting problems DIFF to do VPU internal sequencer IIR1 introduction to all filters states PMAVF256: massive parallelism
  • 4.
    WHY? Typical DSParchitecture issues How much parallelism? multi-lane (2,4,8,16) study How wide? Bit exact hardware, near-bit exact hardware MAC operator multiple units mixing rounding and saturation refer to ITU library (all single, all serial) Compound operators how to deal with multiple overflow? where to do saturation? Complex (as in A+iB) operators DSP Functions where? Coprocessing issues MATLAB specific functions How? when?
  • 5.
    MATHWORKS design flowDesign hierarchy Functional model Bit accurate model Time accurate model Pipeline model Tool (LANGUAGE) name MATLAB SIMULINK VERILOG Alternative names Functional: behavorial Bit accurate: bit exact Time accurate: phase accurate Pipeline: microarchitecture
  • 6.
    TWO DESIGN FLOWSDesign flows Behavioral model Bit accurate model Phase accurate model Pipeline model MATLAB Powerful environment more functions than any other languages integrated from top to bottom pipedream? SIMULINK Implicit timing and concurency Access tools built directly into models easily change parameters to model the FXP effects (rounding,overflow,scaling)
  • 7.
    INTRODUCTION Arithmeticissues and FXP problem 3 modal arithmetic types propagation mode MATLAB issues Modeling concepts
  • 8.
    INTRO - Arithmeticissues and FXP problem This is the FXP perspective 3 types of modal arithmetic Modulo on overflow, wrap-around (max becomes 0) 32 + 32  32 Ex: C integer type, hardware Saturated on overflow, saturate (max+1 becomes max) 32 + 32  32 Matlab integer type (intXX) Promoted overflow is impossible (*) (*) in the limits of the precision 32 + 32  33 (width increase) Ex: Floating point (*), 8-bit data in 256-bit, free hardware Propagation Matlab: integer supersedes FP C: FP supersedes integer
  • 9.
    INTRO - MODULO vs SATURATED ARITHMETIC Matlab use saturated arithmetic Embedded in the (U)INTXX type Note that it is attached to the type not the operator. Saturation is easily added to an operator working in modulo mode Example 32-bit add z = x+y overflow = xor(z.32, z.31) if (overflow) z = sat(z) as long that we have access to bit 32 (means larger width). Matlab offers the unusual problem of having to add modulo mode to a operator working in saturated mode. Solution  Use promotion example 16-bit add first promote z= int32(x) + int32 (y) if overflow, correct the sum by substracting power(2,16) if positive (resp. substraction for negative) final operation, demote, z = int16(z) example 32-bit add first promote z= double(x) + double (y) note that for 32-bit ; z= int64(x)+ int64(y) does not work if overflow, correct the sum by substracting power(2,32) if positive (resp. substraction for negative) final operation, demote, z = int32(z) Issues: The operation must not need more than double of bits. Bit operations (such as xor) only work on unsigned type.
  • 10.
    INTRO - MODULO vs SATURATED ARITHMETIC -TestBench Modulo and Saturated differs only in case of overflow When can overflow happen? Direct demotion long to short , short to byte Hidden demotion: promotion followed by demotion ADD: 32+32 -> 33 -> 32 MUL: 16 x 16 -> 32 -> 16 SHIFT: (32 bit) << 8 -> (40 bit) -> 32-bit The saturation/modulo operation happens when demoted. but the overflow detection can be done in many places. The test cases can all be simplified to a single case of demotion (no need to test add, shift,etc..)
  • 11.
    INTRO - MATLABissues Index problem Index is not a problem for “for loops” Index is a problem for vector use C convention for n= 0:N-1 z(n+MLABIDX) = -x(n+MLABIDX) MLABIDX=1 defined locally in each unit For N-dim arrays I don’t know yet Int data type is saturated How to have Integer using modulo arithmetic? Vector organization: Column or row? for instance accumarray requires a vector organised as Nx1, not 1xN Common src/dst or predicated functions: if (sign) x=abs(x) ; else x=x Predicated functions will touch (or not) the destination in Matlab dst is always modified; impossible to implement Matlab does not have pointers; arguments passed by values
  • 12.
    INTRO - STRUCTUREOF A DSP UNIT DSP UNIT Behavioral model Bit accurate model Phase accurate model Debugging hooks MATLAB CODE
  • 13.
    INTRO to MODELING:Differences between Bit and Phase accurate Bit accurate f1 f2 f3 f4 Phase accurate Phase accurate stage 1 stage 2 stage 3 f1 f2 f3 f4 decoder switch case opcode opcode
  • 14.
    TERMS & DEFINITIONSType Shape Flow processing Building Block Top level unit naming conventions Arithmetic unit naming conventions Variable naming conventions internal to units global to testbench
  • 15.
    TERMS & DEFINITIONS- types and shapes Type FXP is the same as Q format Integer all integer (no fractional part) generally software types (short,long) limited to8,16,32 bit in q format all integers are signed (8q0, 16q0, 32q0) q format could be extended to 25q0,17q0 etc.. fractional: number is all fractional (1q15,1q31) Shape scalar vector (1xN=row vector, Nx1=column vector) a 1x1 vector can be different from a scalar array (NXM) Matrix == Array with special properties (linear algebra)
  • 16.
    TERMS & DEFINITIONS - Flow processing Flow processing alternative names Scheduling Types of flow processing sample (single sample) alternative name:sample by sample sample in, sample out streaming (several samples) alternative name: block by block processing data block: a block? a chunk? block processing (complete block) alternative name: full block processing, vector,array processing
  • 17.
    TERMS & DEFINITIONS- Building Block Building block alternative name unit, function Complexity level Top level msi structural unit Application level Structural unit arithmetic vector Top level,LSI alu,bmu functional unit dsp function matlab function Processing unit alternative name: execution unit Vector Processing unit alternative name: VPU, Vector Unit
  • 18.
    TERMS & DEFINITIONS - Top level BB naming conventions unitnameNN where NN = complexity level alu0, alu1, alu2, alu3, alu4 mac0, mac1, mac2, mac3 bmu0, bmu1, bmu2 ffter1,ffter2,ffter4 cafir1, cafir2 unitnameNN where NN = brand name alu2901 mac1616 macSPARTAN-DSP48A
  • 19.
    TERMS & DEFINITIONS- arithmetic unit naming conventions Type agnostic Multi i/o functionnameNNOO NN = number of inputs (input can be any width or type) OO= number of outputs example: 15-input adder add151 note if signal is 32-bit wide, the total nbr of pins is big Type relevant 1vector to 1vector functionnameLLTT LL = number of lanes TT= type (F,L,S) example: accumarray 16-lane 32-bit data type accumar16_L
  • 20.
    TERMS & DEFINITIONS– variable naming conventions x y 2+1 UNIT z x y v 3+2 UNIT z w a b c 3+2 UNIT d e x y v a b c d … M+N UNIT z w u t s .. a b 2+1 UNIT c b c internal vars= d e f g h.. 2+1 UNIT a
  • 21.
    TERMS & DEFINITIONS– Testbench Global variable naming conventions x y PRINT_MANAGER z w zgolden wgolden gold z maxerr differ bounds CHECKER_ MANAGER input buffers: x y a b BUFFER_MANAGER output buffers: z w u t GENE_MANAGER (8 genes) geneA,B,C,D,E,F,G,H
  • 22.
    OVERVIEW – DSPMA1What is a CPU? a DSP? 4 major blocks DPU, LSU, PCU, COP + register files + MEM + I/O not all blocks need modeling DSP is 90% datapath Eventually it is better to model more Memory hierarchy (at least level1) Program control
  • 23.
    WORKSHOP1 - VNEGWhy? General introduction to the power of Matlab typeless shapeless Dealing with parallelism: multi-lane is straightforward ..but only on ‘pure parallel’ vector processing sequence of input data is independent (same with results) Workshop Target Introduction to the testbench style Working with vectors Introduction to vector/scalar issues
  • 24.
    WK1 - VNEG(description) What? Negate building block same code for all types and all shapes Examples: 1 z = neg(x) z depends on type and shape of x default is a matrix of FP complex numbers 2 x=int16(bufA); z = neg(x) z is a vector of shorts with the length of buffer A 3 bufA= gene(1:40) x=int16(bufA); z = neg(x) z is a 40 x short vector
  • 25.
    WK1 - VNEG(design & issues) Initially this is a straightforward design with only minor cosmetics issues: Test bench style might not be the best fo everybody. Vector design is more involving Shape : input vector is cut in blocks need right equation for computing the length of each block Type: shall we keep it typeless? modulo type might need more work than expected matlab intXX type is saturated. ex: int16 ….. neg(32768) -> -32767 in C code (or modulo) int16 ….. neg(32768) -> 0 to create modulo type, careful to keep vectorized code (do not use “if construct” to detect overflow).
  • 26.
    WK1 - VNEG(list of files) Functions neg.m neg40.m neg40s.m vneg40.m vneg40s.m Testbench files run.m mmain.m test_manager.m gene_manager.m
  • 27.
    WORKSHOP2 - ADD2 MODULO Why? Some processing elements requires modulo arithmetic checksum, coding Some DSP algorithms have better accumulator behavior in modulo arithmetic than in saturated arithmetic. Many DSP system designers prefer to use modulo instead of saturated arithmetic boys vs men? Workshop Target Introduction to vector/scalar issues
  • 28.
    WK2 - ADD2 MODULO (description) What? 32-bit 2-input adder (add2ml) l for long identically 16-bit, 8-bit s for short, b for byte 32-bit 2-input adder (add2mul) ul for unsigned long identically 16-bit, 8-bit us for unsigned short, ub for unsigned byte Example: z = add2ms(x,y)
  • 29.
    WK2 - ADD2 MODULO (design & issues) Input and output variables type: strongly typed : unsigned/signed (long,short,byte) detection of overflow, needs width; alternative is multi functional adder; this requires parameters which is silly (or this becomes an alu) shape: scalar Control parameters? none; this is the simplest of adder Issues with Matlab why not both scalar/vector? the if condition is easier to do in scalar mode. in vector mode needs use of find.
  • 30.
    WK2 - ADD2 MODULO (list of files) Functions add2ml.m add2ms.m add2mb.m add2mul.m add2mus.m add2mub.m Testbench files run.m mmain.m test_manager.m gene_manager.m print_manager.m
  • 31.
    WK2 - ADD2 MODULO (reviewing dfm work) Functions SHAPE: all units are vector based TYPE: all units expects and returns a type add2ml.m: promote to double, should be int64 but Matalb does not allow it. use of fix, to go from double to int add2mul.m: same add2ms.m, add2mus.m nope add2mb.m: promote to int32,could be int16 add2mub.m: same Testbench files run.m mmain.m test_manager.m devectorization -> find better  buffer manager?? gene_manager.m sequences have larger width than width of operators; i.e short sequences are used to test add2 bytes; normally only byte sequences should be used; it is enough to create overflow. dp_manager.m : none print_manager: printf decimal is hardly readable for long; ulong uses hex but then hex is not exactly the answer for signed long.
  • 32.
    WORKSHOP3 - FUSED ADDERS Why? To build multi-lane procesing elements you need fused adders e.g. : any function with sum(..) accumulator= sum(x*k) sumabsdif = sum( abs(x-y)) Example: 16-lane sumabsdif Workshop target: improved test bench fp vs fxp How to test? use of checkers with max errors quasi bit exact fused adder (16 to 1) lane 14 lane 15 lane 1 lane 0 …
  • 33.
    WK3 - FUSED ADDERS (description) What? 16 to 1 (add161) 8 to 1 (add81) 4 to 1 (add41) also 3-input adder (add31) Example: z = add161(x) fused adder (16 to 1) Z X X(16) X(1) Indices are Matlab
  • 34.
    WK3 - FUSED ADDERS (design & issues) Input variables a vector or separate variables? add161, add81  vector add3 , add41  separate Control parameters? not needed in basic units Issues with Matlab int16, uint16, int32 etc.. use saturated arithmetic  less flexibility in design z = (x0+x1) + (x2+x3) is different from z= (x0+x2) + (x1+x3)
  • 35.
    WK3 - FUSED ADDERS (list of files) Functions add3.m add161.m add81.m add41.m Testbench files run.m mmain.m test_manager.m buffer_manager.m gene_manager.m dp_manager.m
  • 36.
    WORKSHOP 4 –ACCUMARRAY Why? Taken as example of a Matlab extensive function Example: 16-lane accumarray Workshop target exploring several examples of fine-grain parallel implementations 2 to 16 lanes fused adder (16 to 1) lane 14 lane 15 lane 1 lane 0 …
  • 37.
    WK4 - ACCUMARRAY (description) What? simple: 1-lane structure(accumar1) 2-lane structure (accumar2) 4-lane structure (accumar4) 16-lane structure (accumar16) Example: z= accumarray(idx,x); MATLAB function z= accumar16(idx,x,xlength,NACC) X and Z= vector of length L; L is a multiple of 16 Idx=index vector of length L NACC =Number of ACCumulators typically 10 Xlength= length of x accumar16 Z X idx xlength NACC
  • 38.
    WK4 - ACCUMARRAY (design & issues) Input variables always a vector; length must be a multiple of number of lanes Control parameters? Not always necessary NACC: is a static parameter xlength: simplifies caller job but it means that the unit must have a sequencer to do the for loop. Issues with Matlab intXX.. adder use saturated arithmetic
  • 39.
    WK4 - ACCUMARRAY (list of files) Unit files accumar1.m accumar2.m 2 implementations accumar4.m accumar16.m Testbench files run.m mmain.m test_manager evolution: test_model1 gene_manager.m dp_manager.m evolution: dp_manager4  dp_manager3  dp_manager2
  • 40.
    WK4 - ACCUMARRAY (test bench evolution) Testbench files Model 0 run example given by Matlab on command line Model 1 = Model 0 + several examples + test _manager + some functional optimization + keep design simple 2- lane, 4-lane,16-lane (m-lane) Model 2= Model 1 + structured test bench (based on script) -> several separate files + gene_manager, + golden model and check_manager + dp_manager some datapath optimization Model 3= Model 2 + fxp data + check is done with = + range value, not bit exact. Model X = model 2 +++ mapping accumar to a Vector processing unit
  • 41.
    WK7 - VPU What? 40-lane structure (vneg40) X and Z= vector of length L; L is a multiple of 40 Xlength= length of x vneg40 Z X xlength xstride bufX bufZ dst@ Matlab hidden src@
  • 42.
    WK7 - VPU(design & issues) Shape : input vector is cut in blocks need right equation for computing the length of each block how many parameters shall we keep it typeless?
  • 43.
    WORKSHOP 10 - FILTER functions Why? Filter are most common basic building blocks in DSP Benchmarks are based on filters Workshop targets standard methodology to all filters architecture issues a working simple block mega block: pmavf256
  • 44.
    WK10 - FILTER functions (description) What? Filter building blocks Examples iir1 pmavf pmavf256 lms nlms
  • 45.
    WK10 - FILTER functions (design & issues) Defining and classifying filter bb fir,cplxfir, lms, iir, biquad, lattice, acf Defining and classifying filter characteritics block, stream, number of states(or taps) all parameters? flexibility Defining variable names input: bufin states: x,w output: z,bufout Is memory inside or outside datapath? outside is better idea
  • 46.
    WK10 - FILTER functions (list of files) IIR1 project Unit files iir1.m iir1_q15.m Testbench files run.m,,mmain.m test_manager.m,gene_manager.m, dp_manager.m PMAVF project Unit files pmavf.m pmavfsec_q15.m pmavf256_q15.m Testbench files run.m,mmain.m test_manager.m, gene_manager.m,dp_manager.m LMS project Unit files lms1.m lms1_q13.m lmsVVl.m Testbench files run.m, mmain.m test_manager, gene_manager.m, dp_manager.m,print_manager NLMS project Unit files nlms.m lms units to be used as comparison Testbench files run.m, mmain.m test_manager, gene_manager.m, dp_manager.m,print_manager dp_manager1.m, dp_manager4.m ???
  • 47.
    WK10 - FILTER functions (further work) IIR1 project Unit files from 3 port to 2 port to 1 port units (see single page doc) Testbench files; na PMAVF project Unit files optimised pmavf256_q15.m using spuds graph (see single page doc) Testbench files: na LMS project Unit files: ?? Testbench files: na NLMS project Unit files: get rid of lms units to be used as comparison Testbench files dp_manager1.m, dp_manager4.m what the heck???
  • 48.
    WK11 - MATLAB functions What? Matlab functions Issues Black box Understanding usage of functions Run examples How much compatibility? types of i/o (matrix,complex) all parameters? flexibility Internal box how to get M-code? write its own code look for librairies (C,) octave,scilab
  • 49.
    WK11 - EQUALIZER Why? A good example of applying a DSP function Seen in: NLMS tips and tricks sigMag proc. Optical receiver What? NLMS: from equations to matlab code Matlab functions: architecting defining an equalizer: normlms,lineareq running an equalizer my former work (equalTDMA, equal): from Matlab code to FXP code others equalizer Z X nbr of taps bufY bufX bufZ dst@ src@ Y adaptfir type