Embedded Intro
Embedded Intro
1
Examples of Embedded Systems
• A modern home
– has one general purpose desktop PC
– but has a dozen of embedded systems.
• More prevalent in industrial sectors
– half a dozen embedded computers in modern
automobiles
– chemical and nuclear power plants
4
Embedded Applications
5
Simple Examples
a simple thermostat controller
• periodically reads the temperature of the
chamber
• switches on or off the cooling system.
a pacemaker
• constantly monitors the heart
• paces the heart when heart beats are
missed
6
Example: Elevator Controller
7
Functional Design & Mapping
F2 Functional
F1 F5
Design
Source:
F4 Source:
Ian Phillips, ARM
Ian Phillips, ARM
VSIA 2001
VSIA 2001
F3
(F2)
Architectural
Threa
(F5)
d
Design
(F3) (F4)
HW1 HW2 HW3 HW4 RTOS/Drivers
Hardware Interface
8
1. Digital Camera: An Embedded System
Source: Embedded System Design: Frank Vahid/ Tony Vargis
(John Wiley & Sons, Inc.2002)
9
1. Digital Camera: An Embedded System
Design
– Four implementations
– Issues:
• General-purpose vs. single-
purpose processors?
• Partitioning of functionality
among different processor
types?
10
Introduction to a simple digital camera
• Captures images
• Stores images in digital format
– No film
– Multiple images stored in camera
• Number depends on amount of memory
and bits used per image
• Downloads images to PC
11
Introduction to a simple digital camera…
• Only recently possible
– Systems-on-a-chip
• Multiple processors and memories on
one IC
– High-capacity flash memory
• Very simple description used for example
– Many more features with real digital camera
• Variable size images, image deletion,
digital stretching, zooming in and out, etc.
12
Designer’s perspective
• Two key tasks
1. Processing images and storing in memory
• When shutter pressed:
– Image captured
– Converted to digital form by charge-coupled
device (CCD)
– Compressed and archived in internal memory
2. Uploading images to PC
• Digital camera attached to PC
• Special software commands camera to transmit
archived images serially
13
Charge-coupled device (CCD)
• Special sensor that captures an image
• Light-sensitive silicon solid-state device composed of many cells
Electronic circuitry
i discharges the cells, activates
Some of the columns are P the electromechanical shutter,
covered with a black strip of and then reads the 8-bit
paint. The light-intensity of charge value of each cell.
these pixels is used for zero- Pixel columns These values can be clocked
bias adjustments of all the out of the CCD by external
cells. logic through a standard
parallel bus interface.
14
Zero-bias error
• Manufacturing errors cause cells to measure slightly above or below
actual light intensity
• Error typically same across columns, but different across rows
• Some of left most columns blocked by black paint to detect zero-bias
error
– Reading of other than 0 in blocked cells is zero-bias error
– Each row is corrected by subtracting the average error found in blocked cells for that row
15
Zero-bias error…
Covered
cells Zero-bias
adjustment
136 170 155 140 144 115 112 248 12 14 -13 123 157 142 127 131 102 99 235
145 146 168 123 120 117 119 147 12 10 -11 134 135 157 112 109 106 108 136
144 153 168 117 121 127 118 135 9 9 -9 135 144 159 108 112 118 109 126
176 183 161 111 186 130 132 133 0 0 0 176 183 161 111 186 130 132 133
144 156 161 133 192 153 138 139 7 7 -7 137 149 154 126 185 146 131 132
122 131 128 147 206 151 131 127 2 0 -1 121 130 127 146 205 150 130 126
121 155 164 185 254 165 138 129 4 4 -4 117 151 160 181 250 161 134 125
173 175 176 183 188 184 117 129 5 5 -5 168 170 171 178 183 179 112 124
Before zero-bias adjustment After zero-bias adjustment
16
Compression
• Store more images
• Transmit image to PC in less time
• JPEG (Joint Photographic Experts Group)
17
Compression…
18
DCT step
• Transforms original 8 x 8 block into a
cosine-frequency domain
– Upper-left corner values represent more of the
essence of the image
(Average for the image)
– Lower-right corner values represent finer details
• Can reduce precision of these values and
retain reasonable image quality
• Quantize – many may become 0
19
DCT step…
21
Quantization step…
1150 39 -43 -10 26 -83 11 41 144 5 -5 -1 3 -10 1 5
-81 -3 115 -73 -6 -2 22 -5 -10 0 14 -9 -1 0 3 -1
14 -11 1 -42 26 -3 17 -38 2 -1 0 -5 3 0 2 -5
2 -61 -13 -12 36 -23 -18 5 0 -8 -2 -2 5 -3 -2 1
Divide each cell’s
44 13 37 -4 10 -21 7 -8 value by 8 6 2 5 -1 1 -3 1 -1
36 -11 -9 -4 20 -28 -21 14 5 -1 -1 -1 3 -4 -3 2
-19 -7 21 -6 3 3 12 -21 -2 -1 3 -1 0 0 2 -3
-5 -13 -11 -17 -4 -1 7 -4 -1 -2 -1 -2 -1 0 1 -1
After being decoded using DCT After quantization
22
Huffman encoding step
• Serialize 8 x 8 block of pixels
– Values are converted into single list using
zigzag pattern
24
Archive step
• Record starting address and image size
– Can use linked list
• One possible way to archive images
– If max number of images archived is N:
• Set aside memory for N addresses and N image-size
variables
• Keep a counter for location of next available address
• Initialize addresses and image-size variables to 0
• Set global memory address to N x 4
– Assuming addresses, image-size variables occupy N x 4 bytes
• First image archived starting at address N x 4
• Global memory address updated to N x 4 + (compressed
image size)
• Memory requirement based on N, image size, and average
compression ratio
25
Uploading to PC
26
Requirements Specification
• System’s requirements – what system
should do
– Nonfunctional requirements
• Constraints on design metrics (e.g.,
“should use 0.001 watt or less”)
– Functional requirements
• System’s behavior (e.g., “output X should
be input Y times 2”)
– ….
27
Requirements Specification…
Initial specification may be very general and come
from marketing dept.
• E.g., short document detailing market need for a low-end digital
camera that:
– captures and stores at least 50 low-res images and uploads to PC,
– costs around $100 with single medium-size IC costing less that
$25,
– has long as possible battery life,
– has expected sales volume of 200,000 if market entry < 6 months,
– 100,000 if between 6 and 12 months,
– insignificant sales beyond 12 months
28
Nonfunctional requirements
29
Nonfunctional requirements…
• Constrained metrics
– Values must be below (sometimes above) certai
threshold
• Optimization metrics
– Improved as much as possible to improve produc
• Metric can be both constrained and
optimization
30
Nonfunctional requirements…
• Power
– Must operate below certain temperature (cooling
fan not possible)
– Therefore, constrained metric
• Energy
– Reducing power or time reduces energy
– Optimized metric: want battery to last as long as
possible
31
FDCT (Forward DCT) formula
32
CODEC…
• Implementing FDCT formula
• Only 64 possible inputs to COS, so table can be
used to save performance time
– Floating-point values multiplied by 32,678 and rounded
to nearest integer
– 32,678 chosen in order to store each value in 2 bytes of
memory
– Fixed-point representation explained more later
• FDCT unrolls inner loop of summation,
implements outer summation as two
consecutive for loops
33
CODEC…
• Implementing FDCT formula
• Only 64 possible inputs to COS, so static const short COS_TABLE[8][8] = {
table can be used to save performance { 32768, 32138, 30273, 27245, 23170, 18204, 12539, 6392 },
time { 32768, 27245, 12539, -6392, -23170, -32138, -30273, -18204 },
– Floating-point values multiplied by 32,678 { 32768, 18204, -12539, -32138, -23170, 6392, 30273, 27245 },
and rounded to nearest integer { 32768, 6392, -30273, -18204, 23170, 27245, -12539, -32138 },
– 32,678 chosen in order to store each value { 32768, -6392, -30273, 18204, 23170, -27245, -12539, 32138 },
in 2 bytes of memory
{ 32768, -18204, -12539, 32138, -23170, -6392, 30273, -27245 },
– Fixed-point representation explained more
later { 32768, -27245, 12539, 6392, -23170, 32138, -30273, 18204 },
{ 32768, -32138, 30273, -27245, 23170, -18204, 12539, -6392 }
• FDCT unrolls inner loop of summation,
implements outer summation as two };
34
Executable model of digital camera
101011010110101010010101101... CCD.C
CCDPP.C CODEC.C
image file
CNTRL.C 1010101010101010101010101010.
..
UART.C
output file
35
CNTRL (controller) module
• Heart of the system
• CntrlCaptureImage uses CCDPP module to input image
and place in buffer
• CntrlCompressImage breaks the 64 x 64 buffer into 8 x 8
blocks and performs FDCT on each block using the
CODEC module
– Also performs quantization on each block
• CntrlSendImage transmits encoded image serially using
UART module
36
CNTRL (controller) module
• Heart of the system
void CntrlSendImage(void) {
• CntrlInitialize for consistency with other modules for(i=0; i<SZ_ROW; i++)
only for(j=0; j<SZ_COL; j++) {
temp = buffer[i][j];
• CntrlCaptureImage uses CCDPP module to input UartSend(((char*)&temp)[0]); /* send upper byte */
image and place in buffer UartSend(((char*)&temp)[1]); /* send lower byte */
}
• CntrlCompressImage breaks the 64 x 64 buffer }
}
into 8 x 8 blocks and performs FDCT on each
block using the CODEC module
void CntrlCompressImage(void) {
– Also performs quantization on each block
for(i=0; i<NUM_ROW_BLOCKS; i++)
• CntrlSendImage transmits encoded image serially
for(j=0; j<NUM_COL_BLOCKS; j++) {
using UART module
for(k=0; k<8; k++)
void CntrlCaptureImage(void) {
for(l=0; l<8; l++)
CcdppCapture();
CodecPushPixel(
for(i=0; i<SZ_ROW; i++)
(char)buffer[i * 8 + k][j * 8 + l]);
for(j=0; j<SZ_COL; j++)
CodecDoFdct();/* part 1 - FDCT */
buffer[i][j] = CcdppPopPixel();
for(k=0; k<8; k++)
}
for(l=0; l<8; l++) {
#define SZ_ROW 64
buffer[i * 8 + k][j * 8 + l] = CodecPopPixel();
#define SZ_COL 64
/* part 2 - quantization */
#define NUM_ROW_BLOCKS (SZ_ROW / 8)
buffer[i*8+k][j*8+l] >>= 6;
#define NUM_COL_BLOCKS (SZ_COL / 8)
}
static short buffer[SZ_ROW][SZ_COL], i, j, k, l, temp;
}
void CntrlInitialize(void) {} }
37
Design
38
Design..
• Implementation
– A particular architecture and mapping
– Solution space is set of all implementations
• Starting point
– Low-end general-purpose processor connected to flash memory
• All functionality mapped to software running on processor
• Usually satisfies power, size, time-to-market constraints
• If timing constraint not satisfied then try:
– use single-purpose processors for time-critical functions
– rewrite functional specification
39
Implementation 1: Microcontroller alone
• Low-end processor could be Intel 8051
microcontroller
• Total IC cost including NRE about $5
• Well below 200 mW power
• Time-to-market about 3 months
• However…
40
Implementation 1: Microcontroller alone…
• However, one image per second not possible
– 12 MHz, 12 cycles per instruction
• Executes one million instructions per second
UART
SOC CCDPP
42
Implementation 2:
Microcontroller and CCDPP
EEPROM 8051 RAM
43
Microcontroller
Block diagram of Intel 8051 processor core
Instruction 4K ROM
Decoder
Controller
ALU 128
RAM
44
Block diagram of Intel 8051 processor core
Instruction 4K ROM
Decoder
Microcontroller ALU
Controller
128
RAM
45
FSMD description of UART
UART
invoked
Start:
Idle Transmi
: t LOW
I=0 I<8
46
CCDPP
• Hardware implementation of zero-bias
operations
• Interacts with external CCD chip
– CCD chip resides external to our SOC mainly because combining FSMD description of CCDPP
CCD with ordinary logic not feasible
C < 66
• Internal buffer, B, memory-mapped to 8051 Idle: invoked GetRow:
B[R][C]=Pxl
• Variables R, C are buffer’s row, column R=0
C=0
C=C+1
indices R = 64 C = 66
FixBias:
and stores in variable Bias B[R][C]=B[R][C]-Bias
C = 64
• FixBias state iterates over same row
subtracting Bias from each element
• NextRow transitions to GetRow for repeat of
process on next row or to Idle state when all
64 rows completed
47
Connecting SOC components
• Memory-mapped
– All single-purpose processors and RAM are connected to
8051’s memory bus
• Read
– Processor places address on 16-bit address bus
– Asserts read control signal for 1 cycle
– Reads data from 8-bit data bus 1 cycle later
– Device (RAM or SPP) detects asserted read control signal
– Checks address
– Places and holds requested data on data bus for 1 cycle
48
Connecting SOC components…
• Write
– Processor places address and data on address and
data bus
– Asserts write control signal for 1 clock cycle
– Device (RAM or SPP) detects asserted write control
signal
– Checks address bus
– Reads and stores data from data bus
49
Software
• System-level model provides majority of code
– Module hierarchy, procedure names, and main program unchanged
• Code for UART and CCDPP modules must be redesigned
– Simply replace with memory assignments
• xdata used to load/store variables over external memory bus
• _at_ specifies memory address to store these variables
• Byte sent to U_TX_REG by processor will invoke UART
• U_STAT_REG used by UART to indicate its ready for next byte
– UART may be much slower than processor
– Similar modification for CCDPP code
• All other modules untouched
50
Analysis
• Entire SOC tested on
VHDL simulator
– Interprets VHDL descriptions Obtaining design metrics of interest
and functionally simulates VHDL VHDL VHDL
execution of system Power
equation
51
Analysis…
• Gate-level description
obtained through
synthesis
– Synthesis tool like Obtaining design metrics of interest
to 1 Chip area
52
Implementation 2:
Microcontroller and CCDPP
• Analysis of implementation 2
– Total execution time for processing one image:
• 9.1 seconds
– Power consumption:
• 0.033 watt
– Energy consumption:
• 0.30 joule (9.1 s x 0.033 watt)
– Total chip area:
• 98,000 gates
53
Implementation 3: Microcontroller and
CCDPP/Fixed-Point DCT
• 9.1 seconds still doesn’t meet performance
constraint of 1 second
• DCT operation prime candidate for improvemen
– Execution of implementation 2 shows microprocessor
spends most cycles here
– Could design custom hardware like we did for CCDPP
• More complex so more design effort
– Instead, will speed up DCT functionality by modifying
behavior
54
DCT floating-point cost
• Floating-point cost
– DCT uses ~260 floating-point operations per pixel
transformation
– 4096 (64 x 64) pixels per image
– 1 million floating-point operations per image
– No floating-point support with Intel 8051
• Compiler must emulate
– Generates procedures for each floating-point operation
» mult, add
– Each procedure uses tens of integer operations
– Thus, > 10 million integer operations per image
– Procedures increase code size
• Fixed-point arithmetic can improve on this
55
Fixed-point arithmetic
56
Fixed-point arithmetic…
57
Fixed-point arithmetic operations
• Addition
– Simply add integer representations
– E.g., 3.14 + 2.71 = 5.85
• 3.14 → 50 = 00110010
• 2.71 → 43 = 00101011
• 50 + 43 = 93 = 01011101
• 5(0101) + 13(1101) x 0.0625 = 5.8125 ≈ 5.85
• Multiply
– Multiply integer representations
– Shift result right by # of bits in fractional part
– E.g., 3.14 * 2.71 = 8.5094
• 50 * 43 = 2150 = 100001100110
• >> 4 = 10000110
• 8(1000) + 6(0110) x 0.0625 = 8.375 ≈ 8.5094
• Range of real values used limited by bit widths of possible resulting
values
58
Fixed-point implementation of CODEC
static const char code COS_TABLE[8][8] = {
• COS_TABLE gives 8-bit fixed-point { 64, 62, 59, 53, 45, 35, 24, 12 },
representation of cosine values { 64, 53, 24, -12, -45, -62, -59, -35 },
{ 64, 35, -24, -62, -45, 12, 59, 53 },
{ 64, 12, -59, -35, 45, 53, -24, -62 },
• 6 bits used for fractional portion { 64, -12, -59, 35, 45, -53, -24, 62 },
{ 64, -35, -24, 62, -45, -12, 59, -53 },
{ 64, -53, 24, 12, -45, 62, -59, 35 },
• Result of multiplications shifted { 64, -62, 59, -53, 45, -35, 24, -12 }
right by 6 };
59
Implementation 3: Microcontroller and
CCDPP/Fixed-Point DCT
• Analysis of implementation 3
– Use same analysis techniques as implementation 2
– Total execution time for processing one image:
• 1.5 seconds
– Power consumption:
• 0.033 watt (same as 2)
– Energy consumption:
• 0.050 joule (1.5 s x 0.033 watt)
• Battery life 6x longer!!
– Total chip area:
• 90,000 gates
• 8,000 less gates (less memory needed for code)
60
Implementation 4:
Microcontroller and CCDPP/DCT
EEPROM 8051 RAM
61
Rewritten CODEC software
static unsigned char xdata C_STAT_REG _at_ 65527;
static unsigned char xdata C_CMND_REG _at_ 65528;
CODEC design
static unsigned char xdata C_DATAI_REG _at_ 65529;
static unsigned char xdata C_DATAO_REG _at_ 65530;
void CodecInitialize(void) {}
void CodecPushPixel(short p) { C_DATAO_REG =
(char)p; }
short CodecPopPixel(void) {
return ((C_DATAI_REG << 8) | C_DATAI_REG);
}
• 4 memory mapped registers void CodecDoFdct(void) {
C_CMND_REG = 1;
– C_DATAI_REG/C_DATAO_REG
while( C_STAT_REG == 1 ) { /* busy wait */ }
}
63
Implementation 2 Implementation 3 Implementation 4
Summary of Performance (second) 9.1 1.5 0.099
implementations Power (watt) 0.033 0.033 0.040
Size (gate) 98,000 90,000 128,000
Energy (joule) 0.30 0.050 0.0040
• Implementation 3
– Close in performance
– Cheaper
– Less time to build
• Implementation 4
– Great performance and energy consumption
– More expensive and may miss time-to-market window
• If DCT designed ourselves then increased NRE cost and time-to-market
• If existing DCT purchased then increased IC cost
• Which is better?
64
Digital Camera -- Summary
We looked at details of
66
Web Servers… get smaller
67
iPic : Tiny Web-Server
2mm*2mm,
PIC 12c508
512b ROM, 24b RAM,
6bits IO, 4MHz RC
68
Micro-Electromechanical Structures
69