Unit 1
Unit 1
9Hours
ARM Architecture: The Acron RISC machine, Architectural inheritance,
Architecture of ARM7TDMI, ARM programmers model, ARM development tools, 3
stage pipeline ARM organization, ARM instruction execution. The advanced
micro controller bus architecture (AMBA). Introduction, structure of assembly
language modules, Predefined register names, frequently used directives,
Macros, Miscellaneous assembler features.
Text 1 (2.1,2.2,2.3,2.4,2.5,4.1,4.3) Text 2( 4.1 to 4.6)
The unit of data size
Bit : a binary digit that can have the value 0 or 1
Byte : 8 bits
Nibble : half of a bye, or 4 bits
Word : two bytes, or 16 bits
Memory
- RAM (Random Access Memory) – temporary storage of
programs that computer is running
The data is lost when computer is off
- ROM (Read Only Memory) – contains programs and
information essential to operation of the computer
The information cannot be changed by use, and is not
lost when power is off.
It is called nonvolatile memory
Registers
The CPU uses registers to store information
temporarily
- Data to be processed
- Address of data/code to be fetched from memory
In general, the more and bigger the registers, the
better the CPU
Registers can be 8-, 16-, 32-, or 64-bit
The disadvantage of more and bigger registers is
the increased cost of such a CPU
ALU (arithmetic/logic unit)
Performs arithmetic functions such as add,
subtract, multiply, and divide, and logic functions
such as AND, OR, and NOT
Program counter
Points to the address of the next instruction to be
executed.
As each instruction is executed, the program
counter is incremented to point to the address of
the next instruction to be executed.
Instruction decoder
- Interprets the instruction fetched into the CPU
- A CPU capable of understanding more
instructions requires more transistors to design
General-purpose microprocessors:
Must add RAM, ROM, I/O ports, and timers externally to
make them functional.
Make the system bulkier and much more expensive.
Have the advantage of versatility on the amount of RAM,
ROM, and I/O ports
Microcontroller:
The fixed amount of on-chip ROM, RAM, and number of
I/O ports makes them ideal for many applications in which
cost and space are critical
In many applications, the space it takes, the power it
consumes, and the price per unit are much more critical
considerations than the computing power
Von Neumann architecture:
A computer whose memory holds both data and
Instructions.
The CPU has several internal registers
Program counter, general-purpose register…
CPU fetches instructions by program counter from Memory
The separation of the instruction memory from the CPU Distinguish a
stored-program computer from a general finite-state machine
Harvard Architecture:
Separate memories for data and program
The program counter points to the program Memory
Hard to write self-modifying programs
Used for one very simple reason
Provide higher performance for digital signal Processing
Most of DSPs are Harvard architectures
Most of the phone calls go through at least 2 DSPs, one at each end
of the phone call
von Neumann vs. Harvard:
Harvard cannot use self-modifying code.
Harvard allows two simultaneous memory fetches.
Most DSPs use Harvard architecture for streaming data:
greater memory bandwidth
more predictable bandwidth
Streaming data
Data set the arrive continuously and periodically
What is RISC?
RISC, or Reduced Instruction Set Computer is a type of
microprocessor architecture that utilizes a small,
highly-optimized set of instructions, rather than a
more specialized set of instructions often found in
other types of architectures.
One cycle execution time: RISC processors have a CPI
(clock per instruction) of one cycle. This is due to the
optimization of each instruction on the CPU and a
technique called PIPELINING
pipelining: Technique that allows for simultaneous
execution of parts, or stages, of instructions to more
efficiently process instructions;
large number of registers: The RISC design philosophy
generally incorporates a larger number of registers to
prevent in large amounts of interactions with memory
The main characteristics of CISC microprocessors
are:
Extensive instructions.
Complex and efficient machine instructions.
Extensive addressing capabilities for memory operations.
Relatively few registers.
In comparison, RISC processors are more or less the
opposite of the above:
Reduced instruction set.
Less complex, simple instructions.
Hardwired control unit and machine instructions.
Few addressing schemes for memory operands with only
two basic instructions, LOAD and STORE.
Many symmetric registers which are organised into a
register file.
CISC RISC
Small code sizes, high cycles per Low cycles per second, large code
second sizes
RISC drawbacks
• RISCs generally have poor code density compared with CISCs.
• RISCs don’t execute x86 code.
First ARM was developed at Acorn Computers
Limited, of Cambridge, England between October
1983 and April 1985.
Before 1990, ARM stood for Acorn RISC Machine
Later on ARM stands for Advanced RISC Machine
RISC concept was introduced in 1980 at Stanford
and Berkley.
ARM core limited founded in 1990
ARM cores
-Licensed partners to develop and fabricate new
microcontrollers
-Soft core
ARM was established in November 1990 as Advanced RISC
Machines Ltd.,
UK-based joint venture between Apple Computer, Acorn
Computer Group and VLSI Technology.
Apple and VLSI both provided funding, while Acorn
supplied the technology.
Acorn, developer of the world’s first commercial single-
chip RISC processor, and Apple, intent on advancing the use
of RISC technology in its own systems, chartered ARM with
creating a new microprocessor standard.
ARM immediately differentiated itself in the market by
creating the first low-cost RISC architecture.
Conversely, competing architectures, which were more
commonly focused on maximizing performance, were
first used in high-end workstations.
1985 Acorn Computer Group develops the world's first commercial RISC
processor
1987 Acorn's ARM processor debuts as the first RISC processor for low-cost
PCs
1990 Advanced RISC Machines (ARM) spins out of Acorn and Apple
Computer's collaboration efforts with a charter to create a new
microprocessor standard. VLSI Technology becomes an investor and the first
licensee
1991 ARM introduces its first embeddable RISC core, the ARM6™ solution
1995 Atmel/ES2, Digital, LG Semicon, NEC and Symbios Logic license ARM
technology.
- ARM's Thumb® architecture extension gives 32-bit RISC performance at
16-bit system cost and offers industry- leading code density
- ARM opens office in Munich, Germany
- ARM launches Software Development Toolkit
- TI samples first ARM Thumb core
- First StrongARM™ core from Digital Semiconductor and ARM
- ARM extends family with ARM8™ high-performance solution
- ARM launches the ARM7100™ "PDA-on-a-chip"
16-bit CISC microprocessor had certain
disadvantages available in 1983
-They were slower than standard memory
parts
-Instructions that took many clock cycles to
complete
-Long interrupt latency
Control over ALU and shifter for every data
processing operations to maximize their usage
Auto-Increment and auto-Decrement addressing
modes to optimize program loops
Load and Store Multiple instructions to maximize
data throughput
Conditional Execution of instruction to maximize
execution throughput
A[31:0] control
P
C incrementer
PC
register
bank
instructi on
decode
A multipl y &
L register
U control
A B
b
u b b
s u u
s barrel s
shifter
ALU
D[31:0]
Data items are placed in register file
-No data processing instructions directly
manipulate data in memory
Instructions typically use two source registers and
single result or destinations registers
A Barrel shifter on the data path can preprocess
data before it enters ALU
Increment/Decrement logic can update register
content for sequential access independent of ALU
General Purpose Registers hold either data or address
All registers are of 32 bits
Total 37 registers
In user mode 16 data registers and 2 status registers are
visible
Data registers: r0 to 15
-Three registers r13, r14, r15 perform special functions
-r13: stack pointer
-r14: link register (where return address is put whenever
a subroutine is called)
-r15: program counter
Depending upon context, registers r13 and r14 can
also be used as GPR
Any instruction which use r0 can as well be used
with any other GPR (r1-r13)
In addition, there are two status registers
-CPSR: Current Program Status Register
-SPSR: Saved Program Status Register
The ARM chip was designed based on Berkeley
RISC I and II and the Stanford MIPS
(Microprocessor without Interlocking Pipeline
Stages)
Features Used from Berkeley RISC design
-a load-store architecture
-fixed length 32-bit instructions
-3-address instruction formats
Features Rejected
-Register windows
-Delayed Branches
- Single Cycle execution of all instructions
Based upon RISC Architecture with enhancements to meet
requirements of embedded applications
32 of 42
• Fifteen general-purpose registers are visible at any one
time, depending on the current processor mode, as r0,
r1, ... ,r13, r14.
• By convention, r13 is used as a stack pointer (sp) in
ARM assembly language. The C and C++ compilers
always use r13 as the stack pointer.
• In User mode, r14 is used as a link register (lr) to store
the return address when a subroutine call is made.
• It can also be used as a general-purpose register if the
return address is stored on the stack.
• In the exception handling modes, r14 holds the return
address for the exception, or a subroutine return
address if subroutine calls are executed within an
exception.
• r14 can be used as a general-purpose register if the
return address is stored on the stack.
33 of 42
• The program counter is accessed as r15 (or pc). It is
incremented by one word (four bytes) for each
instruction in ARM state, or by two bytes in Thumb
state.
• Branch instructions load the destination address into
the program counter. You can also load the program
counter directly using data operation instructions.
For example, to return from a subroutine, you can
copy the link register into the program counter
using:
MOV pc,lr or
MOV r15,r14
• During execution, r15 does not contain the address
of the currently executing instruction. The address of
the currently executing instruction is typically pc– 8
for ARM, or pc– 4 for Thumb.
34 of 42
Instruction set will only process values which are in
registers
The only operations which apply to memory state are
ones which copy memory values into registers(load
instructions) or copy register values into memory(store
instruction)
ARM does not support such ‘memory-to-memory’
operations
Therefore all ARM instructions fall into three categories;
1. Data processing instructions.
- These use and change only register values.
2. Data transfer instructions.
- Loads or stores.
3. Control flow instructions.
- E.g., branch, call/return, tripping into system code
(supervisor calls).
The ARM handles I/O peripherals as memory-mapped devices with
interrupt support. The internal registers in these device appear as
addressable locations within the ARM’s memory map and may be
read and written using the same load-store instructions as any other
memory locations.
Normally most interrupt sources share the IRQ input. Some may
include DMA hardware external to the processor to handle high-
bandwidth I/O traffic
r0
usable in user mode
r1
r2
r3 system modes only
r4
r5
r6
r7
r8_fiq
r8
r9 r9_fiq
r10_fiq
r10
r11 r11_fiq
r12_fiq r13_irq r13_und
r12 r13_abt
r13_fiq r13_svc r14_irq r14_und
r13 r14_svc r14_abt
r14 r14_fiq
r15 (PC)
SPSR_irq SPSR_und
SPSR_abt
CPSR SPSR_fiq SPSR_svc
39 of 42
The CPSR is used in user-level programs to store the condition code bits.
These bits are used, for example, to record the result of a comparison
operation and to control whether a conditional branch is taken or not.
40
Flag Logical Instruction Arithmetic Instruction
41 of 42
31 28 8 4 0
N Z CV I F T Mode
42 of 42
44 of 42
In addition to the processor register state, and ARM system has
memory state.
Memory may be viewed as a linear array of bytes numbered from 0
up to 232-1. Data items may be 8-bit bytes, 16-bit half-words or 32-
bit words.
Words are always aligned on 4-byte boundaries (i.e., the two LSB are
zero ) and half-words are aligned on even byte boundaries.
All ARM instructions are 32 bits wide (except the
compressed 16-bit Thumb instructions which are
described later).
The most notable features of the ARM instruction set are:
The Load-Store Architecture
3-address data processing instructions
Conditional execution of every instruction
The inclusion of very powerful load and store multiple
register instructions
The ability to perform a general shift operation and a
general ALU operation in a single instruction that
executes in a single clock cycle
Open instruction set extension through the coprocessor
inst
A very dense 16-bit compressed instruction set in Thumb
mode
C or Assembler source files are compiled or assembled
into ARM object format (.aof) files
Then linked into ARM image format (.aif) files
The image format files can be built to include the debug
tables required by the ARM symbolic debugger (ARMsd)
which can load, run and debug programs either on
hardware such as the ARM Development Board or using a
software emulation of the ARM (the ARMulator)
The ARMulator has been designed to allow easy
extension of the software model to include system
features such as caches, memory timing characteristics,
and so on
ARM C compiler - ANSI standard, fast, integrated
ARM Assembler - translate assembly instructions to
machine code instructions (object files)
Linker- Takes one or more object files (from C compiler
or ARM assembler) and combines them into one
executable program
Resolve symbolic references (i.e. names of variables or
routines are turned into actual memory addresses)
ARM symbolic debugger - full control on execution and
viewing of registers
ARMulator - emulate the ARM processes with a system
Instruction-accurate modelling
Cycle-accurate modelling
Timing-accurate modelling
ARM C compiler is compliant with the ANSI standard for C
Uses ARM procedure Call Standard for all externally
available functions
Can produce assembly source output instead of ARM
object format, so code can be inspected, or even hand
optimized, and then assembled subsequently
Compiler can also produce Thumb code
The ARM assembler Full macro assembler which produces
ARM object format output that can be linked with output
from the C compiler
Nearer to Machine-level, with most assembly instructions
translating into single ARM (or Thumb) instructions.
Takes one or more object files and combines them into
an executable program
Resolves symbolic references between the object files
and extracts object modules from libraries as needed by
the program
Can assemble the various components of the program in
a number of different ways, depending on weather the
code is to run in RAM or ROM, whether overlays are
required, and so on
Linker includes debug tables in the output file
Can also produce object library modules that are not
executable but are ready for efficient linking with object
files in future.
Front-end interface to assist in debugging programs
running either under emulation or remotely on a target
system such as the ARM development board
Allows an executable program to be loaded into the
ARMulator or a development board and run
Allows the setting of breakpoints, which are addresses in
the code that, if executed, cause execution to halt so that
the processor state can be examined
In the ARMulator, or when running on hardware with
appropriate support, it also allows the setting of
watchpoints
Supports full source level debugging, allowing the C
programmer to debug a program using source file to
specify breakpoints and using variable names from
original program
ARM emulator is a suite of programs that models the
behavior of various ARM processor cores in software on a
host system
Can operate at various levels of accuracy:
Instruction-accurate modeling gives the exact behavior
of the system state without regard to the precise timing
characteristics of the processor
Cycle-accurate modeling gives the exact behavior of the
processor on a cycle-by-cycle basis, allowing the exact
number of clock cycles that a program requires to be
established
Timing-accurate modeling presents signals at the correct
time within a cycle, allowing logic delays to be accounted
for
ARM Development Board is a circuit board incorporating a
range of components and interfaces to support the
development of ARM-based systems
Software Toolkit:
ARM Project Manager is a graphical front-end for the tools
It supports the building of a single library or executable image
from a list of files that make up a particular project
Source files (C, assembler, and so on)
- Object files
- Library files
The source files may be edited within the Project Manager
There are many options which may be chosen for the build
Whether the output should be optimized for code size or
execution time
Whether the output should be in debug or release form
Which ARM processor is the target and particularly
whether it supports the Thumb instruction set
JumpStart: JumpStart tools from VLSI Technology, Inc.,
include the same basic set of development tools but
present a full X-windows interface on a suitable
workstation rather than the command-line interface of
the standard ARM toolkit
There are many other suppliers of tools that support ARM
development
The principal components of an ARM organization with a 3-
stage pipeline are:
◦ Register bank:
A[31:0] control
◦ ALU:
decode
A multiply &
L
◦ Address register and incrementer:
register
U control
A B
b
◦ Data register: u
s
b
u
b
u
barrel
Hold data passing to and from mem.
s s
shifter
D[31:0]
ARM processor up to the ARM7 employ a simple 3-
stage pipeline with the following pipeline stages
◦ Fetch
The instruction is fetched from memory and placed in the
instruction pipeline
◦ Decode
The instruction is decoded and the datapath control signals
prepared for the next cycle. In this stage the instruction ‘owns’
the decode logic but not the datapath.
◦ Execute
The instruction ‘owns’ the datapath; the register bank is read,
and operand shifted, the ALU result generated and written back
into a destination register.
When the processor is executing simple data processing instructions
the pipeline enables one instruction to be completed every clock cycle.
An individual instruction take three clock cycles to complete, so it has
a three-cycle latency, but the throughput is one instruction per cycle.
Since Ninst is constant for a given program, there are only two
ways to increase performance:
◦ Increase the clock rate, fclk.
This requires the logic in each pipeline to be simplified, therefore,
the number of pipeline stages to be increased.
◦ Reduce the average number of clock cycle per instruction, CPI.
This requires either that instructions which occupy more than one
pipeline slot in a 3-stage pipeline ARM are re-implemented to
occupy fewer slots, or that pipeline stalls cause by dependencies
between instructions are reduced, or a combination of both.
Memory Bottleneck
A 3-stage ARM core accesses memory on (almost) every clock
cycle either to fetch an instruction or to transfer data. To get
significant better CPI the memory system must deliver more than
one value in each clock cycle either by delivering more than 32
bits per cycle from a single memory or by having separate
memories for instruction and data accesses.
increment increment
Rd PC Rd PC
registers registers
Rn Rm Rn
mult mult
as ins. as ins.
as instruction as instruction
[7:0]
increment increment
PC Rn PC
registers registers
Rn Rd
mult mult
lsl #0 shifter
= A / A+ B / A- B =A+ B /A- B
[1 1:0]
increment increment
R14
registers registers
PC PC
mult mult
lsl #2 shifter
= A+ B =A
[23:0]
For example, in the Keil tools, you could say something like
s0-s31 or S0-S31
Declaring an Entry Point: In the Keil tools, the ENTRY directive declares an entry
point to a program. The syntax is:
ENTRY
Your program must have at least one ENTRY point for a program; otherwise, a
warning is generated at link time. If you have a project with multiple source
files, not every source file will have an ENTRY directive, and any single source
file should only have one ENTRY directive. The assembler will generate an error
if more than one ENTRY exists in a single source file.
EXAMPLE
AREA ARMex, CODE, READONLY
ENTRY ; Entry point for the application
When writing programs that contain tables or data that must be configured
before the program begins, it is necessary to specify exactly what memory looks
like.
Strings, floating-point constants, and even addresses can be stored in
memory as data using various directives.
DCB: actually defines the initial runtime contents of memory. The syntax is
{label} DCB expr{,expr}…
where expr is either a numeric expression that evaluates to an integer in the
range −128 to 255, or a quoted string, where the characters of the string are
stored consecutively in memory.
Since the DCB directive affects memory at the byte level, you should use an
ALIGN directive afterward if any instructions follow to ensure that the
instruction is aligned correctly in memory.
EXAMPLE:
Unlike strings in C, ARM assembler strings are not null-terminated. You can
construct a null-terminated string using DCB as follows:
C_string DCB “C_string”,0
If this string started at address 0x4000 in memory, it would look like
ALIGN directive: It aligns the current location to a specified boundary by
padding with zeros. The syntax is
ALIGN {expr{,offset}}
where expr is a numeric expression evaluating to any power of two from 2^0 to
2^31, and offset can be any numeric expression. The current location is aligned
to the next address of the form
offset + n * expr
If expr is not specified, ALIGN sets the current location to the next word (four
byte) boundary.
EXAMPLE:
AREA OffsetExample, CODE
DCB 1 ; This example places the two
ALIGN 4,3 ; bytes in the first and fourth
DCB 1 ; bytes of the same word
SPACE:
The syntax is: {label} SPACE expr
where expr evaluates to the number of zeroed bytes to reserve. You may also
want to use the ALIGN directive after using a SPACE directive, to align any code
that follows.
EXAMPLE: AREA MyData, DATA, READWRITE
data1 SPACE 255 ; defines 255 bytes of zeroed storage
Ending a Source File:
This is the easiest of the directives—END simply tells the assembler you’re at
the end of a source file. The syntax for the Keil tools is,
END
When you terminate your source file, place the directive on a line by itself.
MACROS: Macro definitions allow a programmer to build definitions of functions
or operations once, and then call this operation by name throughout the code,
saving some writing time.
In fact, macros can be part of a process known as conditional assembly,
wherein parts of the source file may or may not be assembled based on certain
variables, such as the architecture version (or a variable that you specify
yourself).
Two directives are used to define a macro: MACRO and MEND. The syntax is:
MACRO
{$label} macroname{$cond} {$parameter{,$parameter}…}
; code
MEND
Where $label is a parameter that is substituted with a symbol given when the
macro is invoked. The symbol is usually a label. The macro name must not
begin with an instruction or directive name.
The parameter $cond is a special parameter designed to contain a condition
code; however, values other than valid condition codes are permitted.
The term $parameter is substituted when the macro is invoked.
Within the macro body, parameters such as $label, $parameter, or $cond can be
used in the same way as other variables. They are given new values each time
the macro is invoked. Parameters must begin with $ to distinguish them from
ordinary symbols. Any number of parameters can be used. The $label field is
optional, and the macro itself defines the locations of any labels
EXAMPLE : Suppose you have a sequence of instructions that appears multiple
times in your code—in this case, two ADD instructions followed by a
multiplication. You could define a small macro as follows:
MACRO ; macro definition:
; vara = 8 * (varb + varc + 6)
$Label_1 AddMul $vara, $varb, $varc
$Label_1
ADD $vara, $varb, $varc ; add two terms
ADD $vara, $vara, #6 ; add 6 to the sum
LSL $vara, $vara, #3 ; multiply by 8
MEND
In your source code file, you can then instantiate the macro as many times as
you like. You might call the sequence as,
CSet1 AddMul r0, r1, r2 ; invoke the macro
; the rest of your code
and the assembler makes the necessary substitutions, so that the assembly
listing actually reads as,
CSet1 ; invoke the macro
ADD r0, r1, r2
ADD r0, r0, #6
LSL r0, r0, #3 ; the rest of your code
ORR r1, r1, #1:SHL:3 ;
set CCREG[3]
Here, a 1 is shifted left three bits.
Assuming you like to call register r1 CCREG,
you have now set bit 3.
The advantage in writing it this way is that you are more likely to understand
that you wanted a one in a particular bit location, rather than simply using a
logical operation with a value such as 0x8.
You can even use these operators in the creation of constants, for example,
DCD (0x8321:SHL:4):OR:2
MOV r0, #((1:SHL:14):OR:(1:SHL:12))
MOV r0, #((1 <<14) | (1 <<12))
MOV r0, #0x5000