22
CHAPTER 2
         HARDWARE/SOFTWARE CODESIGN USING
                        BINARY PARTITIONING
          Hardware-Software partitioning is the key issue in the codesign of
embedded systems. The behavioral modeling is transformed into software
binaries which need to be partitioned for hardware and software. The
partitioned software binary for hardware should be decompiled to hardware
description language. A partitioner and decompiler are needed to be designed
in binary level approach. However, to obtain good result in binary level
approach the decompiler design should be an optimized one. The buffer size
has to be estimated in binary level and source level approach to obtain the
communication cost between hardware and software. An estimator is needed
to calculate the number of hardware required for the control construct of the
partitioned hardware.
2.1       PARTITIONING AT BINARY LEVEL
          The hardware replacing the software portion must produce the same
transformation, as the original pure software implementation, to ensure that
the functionality does not change. Partitioning at the binary level makes the
method suitable for dynamic on-the-fly partitioning of software into
hardware. The basic blocks of software binaries are transformed into dataflow
descriptions for implementation of the partitioned software in hardware. As a
first step, the portion of the software binary to be transformed into hardware is
identified using instruction level profiling. The partitioned software (to be
                                                                               23
transformed into hardware) is represented in terms of initial and final state
which is described by set of register value pairs. A CDFG representing the
initial state to final state transformation is derived by equating the final state
attained in terms of algebraic place holders which is then synthesized in
hardware.
2.2         DESIGN FLOW FOR CODESIGN SYSTEM
            The design process would involve the following:
                 system specification
                 optimization
                 system partitioning
                 decompilation
                 hardware estimation of control construct
                 system on chip
2.2.1       System Specification
            The system design process starts with a specification of the system.
The requirements of the system are typically specified in a non-formal
language. However, there have been several attempts to formalize system
specification to help in automation of the various steps that follow.
ESTEREL, for example, is a Finite State Machine based specification
language and is supported by codesign tools like POLIS. System specification
also incorporates Timing, Area and Power constraints.
2.2.2       Optimization
            The approach incorporates both data flow and control optimizations
performed on a suitable intermediate design task representation. The aim is
                                                                             24
not only to enhance productivity of the designer and system developer, but,
also to improve the quality of the final synthesis outcome. The present need in
codesign to move from synthesis-based technology to compiler-based
technology is pointed out. The data flow and control optimizations at a higher
abstraction level can lead to significant size and performance improvements
in both the synthesized hardware and software. It is recognized that guided
optimization can be applied on the internal design representation, independent
of the abstraction level, and need not be restricted to the final stages of
software assembly code generation or hardware synthesis. Function/
Architecture Optimization and Codesign of Embedded Systems will be of
primary interest to researchers, developers and professionals in the field of
embedded systems design.
2.2.3      System Partitioning
           Partitioning refers to the technique of identifying which part of the
system will be implemented in hardware (as an ASIC) and which part in
software. It involves identifying the behavioral description for the different
parts of the system. The hardware part, for example, might be described using
VHDL or Verilog and the software model using the C language. In addition,
the interfacing logic, including any handshaking or bus logic, is also decided
at this time.
2.2.3.1    Source-level Partitioning
           In Source-level partitioning as outlined in Figure 2.1, the
partitioning is done by taking the source code as the input. Hence, the
decision as to which part of software is to be transformed into hardware is
taken at the source-code level. The Berkeley BRASS project, the NAPA C
compiler, the Nimble compiler, Proceler’s product for compiling C code to a
processor and an FPGA coprocessor using profiling information to detect and
                                                                          25
move critical regions of software to the coprocessor platform are some of the
efforts that have gone into reconfigurable computing with partitioning at the
source-code level.
                                          S/W SOURCE
                                            COMPILER
                                           FRONT-END
                                           H/W S/W
                                         PARTITIONING
                               COMPILER                     HDL
                               BACKEND                    SOURCE
      ASSEMBLY                ASSEMBLER
         AND                       &                    SYNTHESIS
       OBJECT                   LINKER
        FILES
                                BINARIES                 NETLISTS
             Figure 2.1 Source-based Partitioning approach
2.2.3.2   Binary-level Partitioning
          Partitioning   an   embedded     system   application   among    a
microprocessor and custom hardware has been shown to improve the
performance. The advent of single-chip microprocessor/FPGA platforms
                                                                          26
makes such partitioning even more beneficial. Although source code
partitioning is preferable from a purely technical viewpoint, binary-level
partitioning provides several very practical benefits for commercial use. The
methodology for binary partitioning is represented in Figure 2.2.
                                         SW SOURCE
                                       COMPILATION
               ASSEMBLY
                  AND                  ASSEMBLER &
                OBJECT                    LINKER
                 FILES
                                          BINARIES
                                          H/W S/W
                                        PARTITIONING
                                                 HW SOURCE
                                                  SYNTHESIS
                                                   NETLISTS
                               BINARIES
              Figure 2.2 Binary based partitioning approach
                                                                             27
2.2.3.3   Dynamic Partitioning
          Another trend is towards dynamic reconfigurability of the
Microprocessor/Coprocessor platform to achieve on-the-fly partitioning of the
software into critical regions that need to be placed on the Coprocessor.
System architecture includes on-chip tools for profiling, decompilation,
synthesis, and placement and routing, for a simplified configurable logic
fabric, to be able to perform dynamic partitioning of real benchmarks.
Bingfeng Mei et al (2000) described a hardware-software partitioning and
scheduling approach for Dynamically Reconfigurable Embedded Systems
using partial reconfiguration approaches based on genetic algorithm and list
scheduling.
2.2.4     Decompilation
          In software oriented partitioning, the section of the code partitioned
for hardware, has to be transformed into a synthesizable RTL model and
subsequently realize as hardware. In hardware oriented partitioning the non-
critical regions are moved to software and critical regions are moved to
hardware which needs to be decompiled to software. In binary partitioning the
critical kernel is moved to the hardware which needs to be decompiled to
synthesizable model and other instructions are moved to the software.
2.2.5     Hardware Estimation for Control Construct
          The control logic for the partitioned hardware is estimated for the
hardware that is partitioned in source level and binary level languages using
behavioral network graph.
2.2.6     System on a Chip (SoC)
          The functional level model is an un-timed model and composed of
function calls. The transaction level model is a timed model, and interactions
                                                                          28
between models are through signals and events. At this level of modeling, the
architecture and algorithms are verified. Any performance issues and
bottlenecks are studied and simulated. Once the architecture and algorithms
are verified, the next step is to determine which part of the system is to be
implemented in hardware and which part goes into software. This process is
called hardware/software partitioning. The software portion runs as embedded
software on the general-purpose microprocessor and the hardware portion is
implemented as an embedded ASIC core as shown in Figure 2.3.
                            FUNCTIONAL LEVEL
                           TRANSACTION LEVEL
                           H/W S/W PARTITIONING
              EMBEDDED SOFTWARE                BEHAVIORAL MODEL
                                        BEHAVIORAL SYNTHESIS
                                                  RTL   MODEL
                                           RTL AND LOGIC SYNTHESIS
                                                  GATE NETLIST
                    Figure 2.3 Typical SoC design flow
          The hardware-software partitioning approach obtains a significant
improvement in performance of the embedded application, compared to the
                                                                               29
pure software implementation approach while using less hardware resources
compared to pure hardware implementation.
2.3       MODELING TOOLS FOR SOC DESIGN PARADIGM
          Small Device C Compiler (SDCC) is a retargettable, optimizing
ANSI - C compiler. Some of the features include:
                a host of standard optimizations such as global sub expression
                elimination, loop optimizations (loop invariant, strength
                reduction of induction variables and loop reversing ), constant
                folding and propagation, copy propagation, dead code
                elimination and jump tables for 'switch' statements.
                MCU specific optimizations, including a global register
                allocator.
                a full range of data types: char (8 bits, 1 byte), short (16 bits,
                2 bytes), int (16 bits, 2 bytes), long (32 bit, 4 bytes) and float
                (4 byte IEEE).
                the ability to add inline assembler code anywhere in a
                function.
          GPSIM is a full-featured software simulator for Microchip PIC
microcontrollers. The standard simulation paradigm including breakpoints,
single stepping, disassembling, memory inspect and change. In addition,
GPSIM supports many debugging features that are generally available only
with in-circuit emulators. For example, a continuous trace buffer tracks every
action of the simulator. Also, it's possible to set read and write break points
even when a specific value is read from or written to a register.
                                                                                  30
          ModelSim provides a comprehensive simulation and debug
environment for complex ASIC and FPGA designs. Support is provided for
multiple languages including Verilog, SystemVerilog, VHDL and SystemC.
The Virtex-II Pro, Virtex-4, and Virtex-5 FPGA families from Xilinx are of
particular interest to system-on-a-chip (SoC) designers. Since, they can
include upto two embedded IBM PowerPC cores. These designs are capable
of running a regular embedded OS (such as Linux or vxWorks) and
interfacing with custom logic implemented in the surrounding fabric.
2.4       DESIGN METHODOLOGY FOR CODESIGN SYSTEM
          The high level language is transformed into binary form by the
assembler and linker of SDCC compiler. The generated binary is profiled
using GPSIM compiler by passing suitable test vectors. The critical kernel of
the profiled binary is identified and partitioned. The control construct of the
partitioned software module that has to be transformed into hardware is
estimated and decompiled into HDL . These steps are shown in Figure 2.4.
                                      SW SOURCE
                                     COMPILATION
       ASSEMBLY
          AND
        OBJECT                    ASSEMBLER AND LINKER
         FILES
                                       BINARIES
                                  H/W S/W PARTITIONING
                       BINARIES                            CDFG
                                                          BNG     DECOMPILATION
                  IP core
                                                     ESTIMATION     HW
                                               Partitioned HW
           Figure 2.4 Design methodology for Codesign System
                                                                      31
2.5   ISSUES IN HARDWARE/SOFTWARE CODESIGN
         The POLIS system (Balarin et al 1997) is intended only for
         control dominated systems whose implementation is based on
         a microcontroller for tasks to be implemented in software and
         ASIC for tasks to be implemented in hardware. The proposed
         work is applicable for both control dominated and data
         dominated systems which is converted into binary. The binary
         is partitioned into software and hardware module. The
         partitioned binary for software implementation is based on
         microcontroller and partitioned binary for hardware is
         optimized and also decompiled into HDL.
         Hardware/ Software Partitioning is manual in POLIS
         approach. The proposed work concentrates Binary level
         partitioning,   Source    level    partitioning   and   Flexible
         partitioning. A partitioning algorithm is proposed to partition
         the critical kernel of the software binaries.
         POLIS supports languages like ESTEREL, graphical FSMs,
         subsets of Verilog or VHDL. In the proposed work binary
         level partitioning is performed in binaries so it is independent
         of high level language. In source level partitioning and
         flexible partitioning it supports C language.
         HW-SW partitioning is based heavily on design experience
         and is very difficult to automate in POLIS. In the proposed
         work the structure of partitioning algorithm for binary level
         partitioning is included in Appendix 1.
         For software synthesis the CFSMs implementation and
         optimization of the desired behavior in a high-level,
         processor-independent representation of the decision process
                                                                              32
                similar to a control/data flow graph in POLIS. In the proposed
                work the intermediate representation is control data flow
                graph.
                The final implementation of the system should be made using
                automatic synthesis as much as possible from this high level
                abstraction, to ensure implementation that is “correct by
                construction”. Validation should be done at the highest
                possible levels of abstraction is made as in POLIS.
2.6       PROFILING AND PARTITIONING
          Hardware/software partitioning is the process of dividing an
application into software running on a microprocessor and hardware
co-processors. On the pessimistic side, Hardware/software partitioning has
limited commercial success partly due to the tool flow problems. These are
primarily due to the following reasons: First, a designer must use an
appropriate profiler to detect regions that contribute to a large percentage of
program execution. Second, a designer must use a compiler with partitioning
capabilities to partition the software source. Such compilers are rare and often
resisted. Third, the designer must apply a synthesis tool to convert the
partitioning   compiler’s       hardware   description   output   to   an   FPGA
configuration. A tool flow requiring integration of profilers, special compilers
and synthesis tools is far more complicated than that of typical software
design, requiring extra designer’s effort that most designers and companies
are not willing to carry out.
          However, on the optimistic side, partitioning is a well-known
technique that can achieve results superior to software-only solutions. The
appearance of single-chip platforms incorporating a microprocessor and
FPGA (field-programmable gate array) on a single chip has recently made
                                                                                 33
hardware/software partitioning even more attractive. Such platforms yield
more efficient communication between the microprocessor and FPGA than
two chip designs, resulting in improved performance and reduced power. In
fact, such single chip platforms encourage partitioning by designers who
might have otherwise created a software-only design. By treating the FPGA
as an extension of the microprocessor, a designer can move critical software
regions from the microprocessor onto FPGA hardware, resulting in improved
performance and reduced energy consumption.
2.6.1      SDCC Compiler
           The RISC architecture has been chosen as the platform for
software. The software is profiled at the instruction level (to identify
bottlenecks in the binary) and run for a fixed number of cycles, from which
the average clock cycles consumed by an instruction is calculated. To
generate the binary a retargettable ANSI-C compiler called SDCC compiler is
used. The compiler does standard optimization such as global sub expression
elimination, loop optimizations (loop invariant, strength reduction of
induction variables and loop reversing), constant folding and propagation,
dead code elimination, jump table for ‘switch’ statements and also specific
optimization like global register allocation. The generated output to be
profiled by passing test vectors, is performed in GPSIM based on the
microchip PIC 18F452 device.
           The compiler accepts the behavioral description with a full range of
data types: char (8 bits, 1 byte), short (16 bits, 2 bytes), int (16 bits, 2 bytes),
long (32 bit, 4 bytes) and float (4 byte IEEE) and has the ability to add inline
assembler code anywhere in a function. The compiler accepts C-language as
input and generates the “cod” file as output. A binary that is partitioned for
implementation in hardware is decompiled using data flow extraction by
creating equivalent HDL source for the basic binary. The presented technique
                                                                            34
is generic for all processors. If the number of clock pulses consumed by an
instruction set exceeds the average by a certain value then that instruction
indicates the presence of a potential candidate block for pure hardware
implementation.
2.6.2     Partitioning Algorithm
          An instruction block is selected for hardware partitioning when the
average clock cycle consumption per instruction in the block exceeds the
average clock cycle consumption per instruction in the whole program. The
scattered instructions that may logically belong to the block, both in terms of
control flow as well as cycle consumption but are separated from the block by
instructions having lower cycle consumption: These scattered instructions can
occur in case of jumps. The source code for partitioning algorithm is
represented in Appendix 1 which partitions the set of instructions that
consume maximum cycles.
2.6.3     Decompilation
          The behavioral model of the high level language is converted into
software binaries and optimization is performed for the partitioned software
binary. The behavioral model is transformed into assembly instruction for
target architecture PIC 18F452 and profiled using GPSIM. The critical kernel
is identified from software binaries using partitioning algorithm of
section 2.3.2. The decompilation is performed using MATLAB based
decompiler. The assembly instruction is converted into control data flow
graph and optimized. An important step in achieving some of these
optimizations involves the recognition of high-level language constructs such
as loops and arrays. Most profilers specifically determine the execution time
of loops by monitoring branch instructions that have a small negative target.
If a loop is completely unrolled by decompiler, then these branch instructions
do not appear in the software binary.
                                                                              35
          Stitt and Frank Vahid (2003) presented a decompiler based on
register transfers for each assembly instruction in software binary which is
then converted into control flow graph to determine basic blocks. A dynamic
jump instruction is excluded from the region of code from hardware
implementation. A data flow graph is generated for each of the basic block
using “definition-use” and “use definition” analysis on the register transfer
lists to connect the tree for each register transfer into a full data flow graph.
A majority of the optimization is performed using behavioral synthesis tool.
Few other optimizations are performed on decompiler.
          The hardware partition along with the software glue code, has to
realize precisely the effect of the software instructions that are going to be
replaced by them, in a partitioned system. Stitt and Vahid (2005) discusses a
decompilation technique for software binaries partitioned for hardware. This
hardware partition must maximize performance and minimize delay and
buffer requirements as the cost function.
          Mittal (2004) discusses translation of binaries to FPGA hardware.
For each instruction processed in the partitioning algorithm, a jump
instruction is checked and lists of all such jump instructions are created. The
reported additional time complexity is O(j*p) where ‘j’ represents the number
of jumps encountered and ‘p’ the number of partitions gathered. Similarly,
space complexity is in the order of O(j). To decompile the binary a dataflow
extraction technique is proposed. The system state representation is shown in
Figure 2.5. The variable ‘c’ and ‘d’ are literals, whereas ‘a’ and ‘b’ are
algebraic placeholders for the registers 0 and 1.
                                                                             36
                    MOV R(0), R(1)
                    ADD C, R(1)
                    MUL R(1),d
                    Initial state S1 : { (0,a), (1,b) }
                    Final state S2 : {(0,a), (1, (a+c)*d) }
                         Sequence of software
                         instruction
               S1                                                   S2
              Initial state S1 : { (i, value(i))| For every register I
              in the system }
              Final state Sn : {(i, value(i)) | for every register I in
              the system}
                             Sequence of software
                             instruction
                 S1                                                   S2
                    Figure 2.5 System State representation
2.7       COMPLEXITY ANALYSIS OF PARTITIONING
          The algorithm makes a single pass of the whole program and
compares the cutoff value with the execution cycles consumed by an
instruction. If ‘n’ is the total number of instructions in the program, the
algorithm’s time complexity is O(n). Similarly, space complexity of the
algorithm can be calculated. The algorithm at a time processes only one
instruction and holds only one instruction in memory if the list of instructions
is maintained in a file. The number of partitions as a worst case is ‘n/2’ since
there are n/2 sections of time consuming code intertwined with n/2 sections of
less time consuming code. Hence, the space complexity, can be O(1) as a best
case, when partitions selected are few, and O(n) as a worst case, when
partitions selected are numerous.
                                                                                37
2.8       DATA FLOW EXTRACTION
          The partitioned instruction is transformed into control data flow
graph which is then converted into HDL source. By decoding the instruction
sequence the initial state and the final state are separated which is needed for
the realization. The algebraic variables are assigned to all the registers in
instruction sequence. For a non-control-flow instruction, there would be one
or more source operands, one or more destination operands and an operation
that maps source operands to destination operands. The destination operands
have been changed by the operation in terms of the source operands resulting
in algebraic expression. The original value of the destination operands in the
initial-state set varies with the final-state set values according to the algebraic
expression. Then the next instruction’s initial-state set is the final-state set of
the previous instruction.
          Initial-state set can suffice to contain only the register-value pairs
whose values are used as source operands in the instruction. Final-state set
can have only those register value pairs whose values are modified from the
original-state set. The registers that are neither used as source operand or
destination operand in register value pairs from the initial-state and final-state
sets are omitted. With an already populated initial-state and final-state set, the
processing of an instruction requires scanning the final-state set for the value
of the source operand. If the lookup is a hit, the value is propagated to the
expression for the destination operand. If the lookup is a miss, a scan for the
same is done in the initial-state set and the value is propagated. If this lookup
also fails, it means that this register is neither a source nor destination. For
this case, a register is added to the initial-state set and assigned a new
algebraic value. This value is propagated to the destination register’s value
expression.
                                                                                38
            At the start of processing of a partition block, no instruction would
have yet been processed, which means the initial-state and final-state sets
would be empty to start with, and get populated as and when more
instructions are processed. In effect, the initial-state and final-state sets are
incrementally updated to reflect the effect of execution upto the instruction in
question from the beginning of the block. This kind of mapping from initial
state to final state is straightforward for a basic block. When loops are
involved, this mapping method quickly degenerates leading to complex,
computationally expensive expressions. So, loops are unrolled using array
function.
2.9         CDFG GENERATION
            Recurrence relations can be used to represent repetitive
modification of registers. Instead, the proposed algorithm envisions splitting
the partition block into basic blocks alternating with control nodes. Within
each basic block the initial state and final state can be defined. In turn, a
CDFG with basic blocks representing data flow alternating with control nodes
is obtained. The algorithm is based on exploring different distinct paths of
execution. The algorithm works by scanning the instructions in sequence, and
creating control nodes and stacking the different possible destination
instruction locations for future exploration of the associated control paths,
whenever a branch instruction (except goto, since it doesn’t determine control
flow at runtime as in a multiple choice scenario) is encountered. A stacked
destination location is not processed if it has been already encountered in
another branch instruction. This effectively avoids duplication of a basic
block. When the boundary of the partition is hit or a branching instruction is
hit, the scan stops signifying the end of a basic block. Control nodes are then
created and the stack is popped for scan of the next basic block. For each
scan, an initial state set and a final state set are created. When creating control
                                                                           39
nodes, the algorithm checks against already created control nodes to avoid
duplication of nodes. The algorithm works by tracing of execution paths. The
structure of CF generator with register allocation is given in Appendix 2 of
the thesis. Similarly Copy propagation and Dead code elimination is in
Appendix 3 of the thesis . Copy propagation is used to create the dead code(
i.e. when (regw,a) is preceded by (reg0,w+a), it is modified to (reg0,a+a). So
(regw,a) can be removed inorder to reduce the space complexity. Both CF
Generator and Decompiler are designed by using growing array. Hence, it can
be scaled up to any number of instructions. As an illustration, a code snippet
is shown in Figure 2.6 in pseudocode form.
           add reg-b to reg-a                   // basic block 1
           // some more processing
           if (reg-a<> 0) goto block (3) // control node1
           add reg-c to reg-b                   // basic block 2
           // some more processing
           goto block(4)
           add reg-c to reg-a                   // basic block 3
           // some more processing
           add reg-d to reg-b                   // basic block 4
           // some more processing
                     Figure 2.6 Example code snippet
         The control flow can be visualized as shown in Figure 2.7.
                                                                                                                  40
                                         Basic Block 1
                                                               True
                                              a<>0
                                                                               Sequence of
                                    False                                      processing
                                                                               instructions
                                         Basic Block 2
                                         Basic Block 3
                                         Basic Block 4
                                             node
        Figure 2.7 Control flow and Block visualization in memory
        The application of the dataflow extraction algorithm results in the sets
of basic blocks and control nodes as shown in Figure 2.8.
                        Basic Block 1                      Basic block 1 for
                                                            derived CDFG
                                                    True
                                 a<>0
                         False
                         Basic Block 2                      Basic Block 3
    Basic block 2 for                                                                         Basic block 3 for
    derived CDFG                                                                               derived CDFG
                         Basic Block 4                      Basic Block 4
                                                 node
                  Figure 2.8 CDFG generated for the code snippet
                                                                               41
          The basic block 4 in the original code snippet has been duplicated.
The reason is that “goto” instruction is not considered as a branch instruction
in the proposed paradigm, since the destination location of the branch is fixed
and only the distinct paths of execution are traced. An alternative would be to
consider “goto” as any other branch instruction and create another block in
CDFG for block 4.
          However, separation of block 3 from block 4 can be identified only
if we check for every instruction that is processed as to whether it is a starting
location of basic block 4. The rule of thumb for creating the initial and final
state sets can be stated as follows:
          1.    Initialize the sets to null set.
          2.    For every instruction scanned, if source operand is listed in the
                final-state set, use the corresponding value listed in the final-
                state set for the destination operand’ value expression.
          3.    If step 2 fails and source operand is listed in the initial state
                set, use the corresponding value for the destination operand’s
                expression.
          4.    If step 3 also fails, add source operand to the initial-state set
                with a distinct algebraic value, and use this value for
                destination operand’s express
          5.    Add or update the destination operand with the expression to
                the final-state set as required.
          A minor constraint for this algorithm to work is that, there should
be no jumps from other part of the code into the middle of the partition block.
This check can easily be incorporated in the partitioning algorithm itself with
a minor cost in terms of time complexity. For each instruction processed in
the partitioning algorithm, a jump is checked and a list of all such jump
                                                                                  42
instructions is made. The algorithm for dataflow extraction is given in
pseudocode format in Appendix 4. For a dereferences address in an
instruction, the address register is placed in the initial-state set. They represent
the destination/source register in the initial/final state sets by a dereferencing
operator combined with the address register’s value.
2.10       OPTIMIZATION OF CDFG
           The optimizations steps to be applied for the CDFG once it is
ready, are listed as follows:
           1.   For a basic block, if the registers in the initial-state set are also
                listed in the final-state set and if the value is the same in both
                (can happen if registers are shadowed and used as scratchpads
                temporarily), the entry can be removed from the final-state
                list.
           2.   After processing by step 1, if the values in the initial-state set
                are not used in the final-state set, these can be removed from
                the initial-state set as well.
           3.   Identify the section of the graph that has only forward jumps
                and no jumps from the remaining part of the graph into the
                middle of the section. Then, the initial-state sets and final-state
                sets of all the basic blocks in that section can be combined
                together, and the intermediate control nodes can be removed.
                This transformation is termed as the Merge operation. The
                destination operand element in the combined final-state set
                can have multiple values, each corresponding to a basic block,
                that are tagged with a conditional expression corresponding to
                the control nodes that lead to the execution of the basic block.
                The conditional expressions for these values of an operand are
                                                                                             43
                  mutually exclusive, meaning that, only one of the values is
                  selected at any instant of execution in the CDFG, depending
                  on the evaluation of the expressions.
                  As an illustration for demonstrating the last step stated,
                  Figure 2.9 shows the result of merging a basic block (B1), a
                  control node (C1) following the basic block, and the two basic
                  blocks (B2,B3) that represent the two paths that a control node
                  can lead to. This is the basic structure for a merge operation.
                                     Basic Block1
                                     Initial =
                                     {(reg1,x),(reg2,y),
                                     Final =
                                     ((reg3,x+y),(reg9,z))
                        True                                 False
                                           reg7   0
                                                             Basic Block 3
Basic Block 2
                                                             Initial ={(reg3,a), (reg1,b),
Initial = {(reg8,e)}                                         (reg6,c)}
Final =                                                      Final = {(reg4,a),
{(reg4,e),(reg9,2e)),
                                                             (reg5,a+b+c)}
         (reg2,e}
                    Basic Block 1
                    Initial ={(reg1,x), (reg2,y), (reg6,c), (reg7,d),
                    (reg8,e), (reg5,f)}
                    Final = {(reg3,x+y),(reg4,{e|d 0, x+y| if d =0}),
                    (reg5 {2x+y+c | if d=0, f | if d 0}),
                    (reg9, {z| if d=0, 2e| if d 0))
                    (reg2, {e| if d 0, y | if d=0})}
                Figure 2.9 Merge Operations for Basic Blocks
                                                                                 44
          All constructs having forward jumps only and no jumps from other
part of the graph into them would be contained by recursively applying this
model structure.
          4.   Loops are unrolled using array function.
          5.   Linear scan algorithm is used for register allocation
               optimization.
          6.   Loops rerolling is performed to calculate the buffer size.
2.10.1    Merge Operation for Basic Blocks
          The detailed steps for executing the Merge operation of the Basic
Blocks as given in e Figure 2.9 are:
          1.   Propagate values for registers in the set B1.final         B2.initial
               from B1.final to B2.initial to B2.final.
          2.   Set B2.initial = B2.initial - B1.final     B2.initial.
          3.   Propagate values for registers in the set B1.initial       B2.initial
               from B1.initial to B2.initial to B2.final.
          4.   Set B2.initial = B2.initial - B1.initial     B2.initial.
          5.   Set B1.initial = B1.initial U B2.initial. Discard B2.initial (as it
               is not needed further).
          6.   Apply the same steps for B3 as had been done for B2.
          7.   Propagate value from B1.initial to control node C1’s
               conditional expression. Add an element to B1.initial if
               necessary.
          8.   For registers in B3.final     B2.final, assign both values (listed
               in B2.final, B3.final) to the register, each tagged with the
                                                                         45
     appropriate condition. e.g., (B2.final.reg-a | if C1 is true,
     B3.final.reg-a | if C1 is false).
9.   For registers in (B1.final      B2.final) – (B3.final     B2.final),
     assign both values (listed in B1.final, B2.final) to the register,
     each tagged with the appropriate condition. e.g., (B2.final.reg-
     b | if C1 is true, B1.final.reg-b | if C1 is false).
10. For registers in (B2.final - (B1.final       B2.final) – (B3.final
     B2.final))    B1.initial, assign both values (listed in B1.initial,
     B2.final) to the register, each tagged with the appropriate
     condition. (e.g., (B2.final.reg-c | if C1 is true, B1.initial.reg-c |
     if C1 is false).
11. For registers remaining in B2.final, i.e., B2.final – ((B2.final -
     (B1.final     B2.final) – (B3.final        B2.final))    B1.initial),
     add the registers to B1.initial with a distinct identifier.
12. Repeat steps 9 to 11 for B3 as had been done for B2. The
     mutually exclusive multiple value sets can be synthesized in
     hardware using multiplexers with the conditional expressions
     functioning as the select lines. As a further optimization step,
     Sub expression elimination/deduction can be used for
     expression evaluations. Expressions can be tagged with
     identifiers while building/merging the initial-state and final-
     state sets and registers can be mapped to these identifiers
     instead of the expressions. This avoids duplication of the same
     expression.
                                                                              46
              In this proposed work, the optimizations performed during
decompilation include:
         i.       Optimization using Register allocation for partitioned binary
                  is performed.
        ii.       Copy propagation is used to create the dead code( i.e. when
                  (regw,a) is preceded by (reg0,w+a), it is modified to
                  (reg0,a+a) which is not reported earlier. So (regw,a) can be
                  removed inorder to reduce the space complexity.
       iii.       Dead code elimination is performed
       iv.        Loops are unrolled using array functions and this is performed
                  in the decompiler itself - Gaurav Mittal et al 2007 does only
                  manual unrolling which is the input code to PACT compiler
        v.        Loop rerolling is done manually
       vi.        Merging is performed for each block separately which is not
                  reported earlier- For an ellip benchmark it takes 81 states
                  without merging and 3 states with merging. For DCT
                  benchmark it takes up 24 states without merging and 10 states
                  with merging.
      vii.        Merging for forward branching is performed to reduce the
                  number of states which is not reported earlier- For an ellip
                  benchmark there is no forward branching (so further reduction
                  is not possible). For DCT benchmark with merging it takes up
                  10 states and merging for forward branching takes up 3 states.
                                                                              47
2.11      COMPLEXITY ANALYSIS OF DATA FLOW
          EXTRACTION
          Traditionally, consecutive sets of blocks are merged into one large
block, thereby allowing the compiler to possibly schedule more instructions in
parallel (Gaurav Mittal 2007). Alternately, in this work, merging is performed
for forward branching and blocks as represented in Figure 2.7.
          The time complexity of the algorithm can be computed assuming
each instruction has a single source operand and a single destination operand.
If the number of instructions in a partition is n, and the number of instructions
determining control-flow (equivalent to number of control nodes that would
be created) is c, the average length of a block can be stated as n/c. The
number of blocks that could be created accounting for possible duplication of
blocks is 2c. If each operand in each instruction is assumed distinct as a worst
case, the number of comparisons to create initial and final state sets for a
block of length n/c would be O((n/c)2). The total number of comparisons for
all 2c blocks would be O(n2/c). Comparisons to determine the uniqueness of
control nodes is of O(c2). Similarly the determination of the uniqueness of
basic block nodes is O((2c)2) = O(c2). Hence, in summary, the time
complexity of the algorithm is O(n2/c) + O(c2) + O(c2) = O(n2/c) + O(c2).
Similarly, the space complexity can be calculated. For a basic block, the
initial and final state sets can have elements of O(2n/c), and for 2c basic
blocks, the space complexity is O(2n/c*c) = O(n). Similarly space complexity
for control nodes is O(c). Similarly for the stack, the complexity is determined
by the number of basic blocks 2c, hence O(c). Hence the total space
complexity is O(n) + O(c).
                                                                                 48
2.12      COPROCESSOR SYNTHESIS
          The condition for successful synthesis using the CDFG derived
from dataflow extraction is that there should be no jumps into the middle of
the software block to be replaced by hardware, from the rest of the software.
This check can be incorporated in the partitioning algorithm itself without
much effect on the algorithm complexity as was stated previously. The steps
for hardware synthesis of a basic block can be stated as follows:
          1.   For the registers in the final-state set that have values which
               are straight moves from those in initial-state set, software code
               to move the values are inserted into the software glue code
               that calls the hardware. This glue code replaces the original
               software block. These registers can be removed from the final-
               state set. Also, registers in initial-state set whose values are
               not used in the final state set after this step can be removed.
          2.   For each register in the final-state set, the blocks that can
               follow the partition block in execution are scanned for a read
               or write on the corresponding register. If the first operation
               encountered is a write, these can be removed from the final-
               state set.
          3.   The elements in the initial-state set now constitute the input
               parameters for the coprocessor. These can be fed through
               ports from the microcontroller. The expressions in the
               final-state set constitute the output parameters, which can be
               read from the coprocessor and moved to the appropriate
               registers by software code.
                                                                                           49
2.13        EXPERIMENTAL RESULT
            The results obtained by applying the proposed algorithm to the
benchmarks are shown in Table 2.1.
                         Table 2.1 Runtime Comparison
                                           S/W        Partition
                      Basic     Basic    runtime     runtime in    Pure
                     block 1   block 2                             H/W      Speed up by
Benchmark     Loop                       in cycles      cycles
                      (mul      (mult                             (in 100   partitioning
                       int)     long )   (100 ns)     (100 ns)      ns)
                                          approx       approx
Dct                       -       -        26667        23222           -          1.14
Diffeq          -                 -          645          357        4.5            1.8
Ellip           -                 -          952          538      0.375           1.76
FIR             -                 -          769          402      0.192            1.9
IIR             -                 -          714          366       0.35           1.95
Lattice                   -                13333         8855           -           1.5
Nc                        -       -        16000        15161           -          1.05
Volterra                  -       -        16000        15588           -          1.02
Wavelet                   -       -        13333        12922           -          1.03
Wdf7            -         -                20000        13479           -           1.5
            Louis-Noel Pouchet et al (2008) reported the performance
distribution of dct with maximum of 1.4 speedup and in the proposed work
the speedup achieved is 1.14. Hardware utilization for the partitioned module
is shown in Table 2.2.
                                                                               50
           Table 2.2 Hardware Utilization of Partitioned Binary
                                                          H/W utilization for
                              Basic
                                       Basic block 2          partition
Benchmark        Loop         block
                                       (mult long )         (M-multiplier)
                            1(mul int)
                                                              (S-Slice)
Dct                              -             -            31/1408 S = 2%
                                                            3/12 M= 25%
Diffeq              -                          -
                                                            8/1408 S= 0%
                                                            3/12 M= 25%
Ellip               -                          -
                                                            8/1408 S= 0%
                                                            3/12 M= 25%
Fir                 -                          -
                                                            8/1408 S= 0%
                                                            3/12 M= 25%
Iir                 -                          -
                                                            8/1408 S= 0%
                                                            97/1408 S=6%
Lattice                          -
                                                            10/12 M= 83%
Nc                               -             -            31/14008 S=2%
Volterra                         -             -            31/14008 S=2%
Wavelet                          -             -            31/14008 S=2%
                                                            66/1408 S =4%
Wdf7                -            -
                                                            10/12 M = 83%
           Out of the three sections of code that frequented most of the
benchmarks, two are procedure calls representing integer multiplication and
long integer multiplication and the third is a conditional loop. These were
selected for hardware synthesis out of the candidate partitions selected by
applying the partitioning algorithm with a cutoff ratio of 1.5.
           The percentage of partitioned instruction              is tabulated in
Table 2.3 and circuits were synthesized for the target device Xilinx Virtex
xc2v8000-5-ff1152.
                                                                                      51
Table 2.3      Percentage of instructions partitioned and corresponding
               speedup
                                                                    %
              Total    Instructions Instruction        % of      runtime    Speedup in
Benchmark of number partitioned Partitioned        Instructions   of the       % by
          instructions   For SW       for HW       partitioned instructions partitioning
                                                                partitioned
 Dct           3504          3481        23            0.7          12          114
 Diffeq         410          330         80            17          59.6         180
 Ellip          574          494         80           12.45        58.2         176
 Fir            404          324         80           17.75        63.9         190
 Iir            382          302         80           18.8         65.3         195
 Lattice       2613          2589        24            9.2         38.1         150
 Nc            3886          3862        24            0.6          5.3         105
 Volterra      2664          2640        24            0.9          2.6         102
 Wavelet       3617          3593        24            0.7          3.1         103
 Wdf7          3958          3692       266            5.5         37.4         150
            The total elapsed time to generate the hardware module from the
binary is given in Table 2.4 and is obtained using MATLAB.
               Table 2.4 Total elapsed time for decompilation
                      Total         Instructions     Instruction     Elapsed time for
 Benchmark         of number        partitioned      Partitioned      decompilation
                  instructions        For SW           for HW             (sec)
 Dct                  3504             3481                  23            60.187
 Diffeq               410               330                  80            64.969
 Ellip                574               494                  80           100.687
 Fir                  404               324                  80            92.954
 Iir                  382               302                  80            88.516
 Lattice              2613             2589                  24            76.547
 Nc                   3886             3862                  24            74.641
 Volterra             2664             2640                  24            72.422
 Wavelet              3617             3593                  24            73.906
 Wdf7                 3958             3692                  266          206.578
                                                                             52
               Gaurav Mittal et al 2007 has first decompile the software
binaries to a high-level language, such as C, and then use a behavioral
synthesis tool to generate the hardware from the high-level design. The
speedup achieved for an ellip benchmark is 3.19, for an IIR it is 2.06 and for a
diffeq it is 10.30 using DSP processor and Xilinx. The speedup achieved in
the proposed work (Table 2.1) is less when compared because only few
instructions of software binaries are partitioned for hardware but it results in
very good reduction of hardware cost. Many of the benchmarks showed
significant speedup on partitioning while using very less resources compared
to pure hardware implementation. Particularly notable are the rows for the
Ellip and IIR benchmarks. They show a speedup of 1.76 and 1.95
respectively. There is also a high savings in hardware resources compared to
pure hardware implementation (slices of 0%, 0% in partitioned approach
compared to 55%, 18% in pure hardware, and Multipliers of 25%, 25% in
partitioned approach compared to 116%, 116% respectively in pure hardware
which exceeds the multiplier resources available). These results are plotted in
Figure 2.10.
Figure 2.10 Comparative Bar chart of Percentage of Hardware
               Utilization
                                                                             53
          A comparative bar chart showing the speedup obtained is plotted in
Figure 2.11.
         Figure 2.11 Comparative Bar chart of Speedup gained
          The Figure 2.11 shows to what extent each program’s size in terms
of number of cycles consumed per run gets reduced. The improvement in
results for certain benchmarks (Dct, Nc, Volterra, Wavelet) may not be much,
but from Table 2.3 and Figure 2.12, it can be seen that the temporal size,
(temporal size is the percentage of runtime of the partition as defined by
Jantsch et al (1994), for the partition selected manually out of the candidate
partitions, with small size as a desired criteria to make easy the processing of
dataflow extraction, is very low compared to other benchmarks.
                                                                            54
Figure 2.12 Bar chart of Percentage of Instructions partitioned and
             corresponding speedup
2.14      PROFILING RESULTS
          The profiling results of the Diffeq benchmark is shown in
Figure 2.13. First three columns show instruction word location, the binary
opcode and opcode in assembly format respectively. Last column shows the
number of times the instruction location was traversed. The part shown is the
procedure for integer multiplication. The program was run repeatedly to
accumulate data for a 10,000 cycle period which averaged 645 clock
cycles per run for software runtime and 357 cycles per run for partitioned
runtime. The speedup achieved is 1.8. The partitioned instruction which is
rerolled for Diffeq is shown in figure 2.14 is synthesized to obtain the
hardware clock period. The ‘I’, ‘J’, ‘H’ and ‘G’ are the variables that need to
be communicated between hardware and software which determines the
buffer size when loop rerolled. The profiling result showed that this procedure
call with 71 instruction cycles had each instruction in it consuming 84 cycles.
The Go instruction establishes communication from software to hardware.
                                                                          55
                  Figure 2.13 Profiling results for Diffeq
          Using the technique to extract dataflow description for this block,
the following set of input and output parameters for the coprocessor shown in
Figure 2.14 was obtained.
                Figure 2.14 Dataflow extraction for Diffeq
                                                                                  56
           Similarly, Figure 2.15 shows profiling results for the Dct
benchmark.
                     Figure 2.15 Profiling Results for Dct
           The instruction rlcf, which means rotate left with carry, takes as
input the carry bit of status register, and generates again as output the carry
bit. This means the carry bit in the initial-state as well as the final-state is need
to be hold. This emphasizes that the bits of status registers also need to be
held in the final-state if they are going to be affected by the present
instruction, and in the initial-state if they are going to be used in the
instruction execution. The structure CF generator is represented in the figure
2.16. The optimization such as dead code elimination and copy propagation is
performed in CFG as shown in Figure 2.17. The generated CDFG is shown in
Figure 2.18.
                                  58
Figure 2.17 Optimization of CFG
                                                                              59
                          Figure 2.18 CDFG for Dct
          The status bits have been omitted from consideration in the
proposed algorithm for the sake of brevity, since they are also a part of a
register. The CDFG derived after merge operation is shown in Figure 2.19.
The hardware realization of the CDFG represents the body of the loop, with
registers to hold the result of computation for each iteration of the loop.
                                                                             60
           Figure 2.19 CDFG for DCT after Merging operation
          The coprocessor communicates with the processor using a
handshake protocol. The processor issues a “go” signal after placing all the
required input parameters for computation, on the data lines. The coprocessor
starts computation at the edge of the clock pulse after receiving the go signal.
Each clock pulse signifies the start of a new iteration. After computation for
all iterations are complete, the coprocessor issues the done signal, placing the
output parameters on the data lines. Hence there is no need for clock
synchronization. The coprocessor can run independently from the processor,
and both can run tasks simultaneously. Communication buffers can be used
instead of the data lines, in which case, the maximum of the number of input
parameters and the number of output parameters is the buffer size needed for
                                                                            61
communication. The estimates of buffer size requirements for each partition
block are shown in Table 2.5.
             Table 2.5 Buffer size / Data bus width Estimates
                                         Buffer size/ Data bus
               Partition Block
                                             size mbytes
                     Loopl                            7
            Basic block 1 (mul-int)                   4
           Basic block 2 (mul-long)                   8
          The Table 2.6 shows the number of adders/subtractors and registers
required for different practical implementations with and without forward
branching. The present work has performed merging with forward branching
and hence, there is a good improvement in the utilization of resources. For ex:
Merging the (new) consecutive sets of blocks into one large block, allows the
compiler to possibly schedule more instructions in parallel (Gaurav Mittal
et al 2007). Merging for large blocks requires two adders/subtractors and
thirty nine registers. However, in this work by the novel use of merging with
forward branching the number of registers reduces to 22 and forward
branching does not occur in Diffeq, Ellip, Fir, Iir and Wdf7 .
                                                                                     62
                 Table 2.6 Merging for binary level partitioning
                                                                Merging forward
              Without Merging                Merging
                                                                  branching
             Adders/                  Adders/                   Adders/
                         Registers                 Registers                 Registers
           subtractors               subtractors               subtractors
DCT             14          125          2             39          2            22
Diffeq          24          123          1             8            -            -
Ellip           24          123          1             8            -            -
Fir             24          123          1             8            -            -
Iir             24          123          1             8            -            -
Lattice         14          124          2             39          2            22
NC              14          124          2             39          2            22
Volterra        14          124          2             39          2            22
Wavelet         14          124          2             39          2            22
Wdf7            36          125          2             47           -            -
      The register allocation is performed for critical kernel of software binaries
(integer multiplier, long integer multiplier and loop is identified) using
Maximal Clique which is represented in Appendix 5. Initial and final state is
used to represent each block. Merging basic blocks, Merging blocks with
forward branching is performed which is not reported earlier and also loop
rerolling is made manually. Since optimization is made in software binaries
this methodology supports control dominated and data dominated systems.
             Gaurav Mittal et al 2007 has first decompile the software binaries
to a high-level language, such as C, and then use a behavioral synthesis tool to
generate the hardware from the high-level design. The speedup achieved for
an ellip benchmark is 3.19, for an IIR it is 2.06 and for a diffeq it is 10.30
using DSP processor and Xilinx. The speedup achieved in the proposed work
(Table 2.1)is less when compared because only few instructions of software
                                                                             63
binaries are partitioned for hardware but it results in very good reduction of
hardware cost.
2.15      CHAPTER CONCLUSION
       In this Chapter, a partition algorithm is designed and the decompiler is
designed with known optimization techniques like dead code elimination,
loop unrolling, loop rerolling, register allocation and merging forward
branching with set of blocks. The novelty of the method proposed in the
chapter is forward branching using merging, Register allocation for software
binaries has not been reported earlier, Loops are unrolled using array
functions which is not represented earlier and this is performed in the
decompiler itself. Gaurav Mittal et al 2007 does only manual unrolling which
is the input code to PACT compiler. Loop rerolling is done manually.
Merging is performed for each block separately which is not reported earlier-
For an ellip benchmark it takes 81 states without merging and 3 states with
merging. For DCT benchmark it takes up 24 states without merging and 10
states with merging. Merging for forward branching is performed to reduce
the number of states which is not reported earlier. For an ellip benchmark
there is no forward branching (so further reduction is not possible). For DCT
benchmark with merging it takes up 10 states and merging for forward
branching takes up 3 states. Optimization is often non-trivial since there are
multiple conflicting constraints involved (e.g. performance vs cost vs power
consumption in digital VLSI design), and optimization problems are generally
belong to class of NP-complete problems. Hence, heuristics are used instead
of exact algorithms to find optimal solutions, and reduce the complexity. For
ex: DCT implementation is non-trivial and the implementation of this non-
trivial system is presented in the thesis. Buffer size estimation is carried out
for binary level approach and source level approach and observed to be same
for both when applied for an elliptic wave filter. The critical kernel is
partitioned to hardware in binary level partitioning while multipliers are
partitioned to hardware in source level partitioning.
Figure 2.16 Structure of CF Generator
                                        57