KEMBAR78
Advanced Computer Architecture Guide | PDF | Cpu Cache | Class (Computer Programming)
0% found this document useful (0 votes)
182 views22 pages

Advanced Computer Architecture Guide

1. Computer architecture deals with the design of instruction set, functional units and their interconnection at the hardware level. Microarchitecture refers to how a particular instruction set architecture is implemented in hardware. 2. Out-of-order execution overcomes data hazards by using techniques like register renaming, reservation stations, reordering buffers and dynamic scheduling to allow instructions to execute in any order as long as dependencies are addressed. 3. Caches exploit temporal and spatial locality to reduce the average memory access time by storing frequently used data closer to the processor in a hierarchical memory structure. Cache performance is optimized by reducing miss rates and miss penalties through techniques like larger blocks, higher associativity, write buffers and multilevel caching.

Uploaded by

Sridhar Gunnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
182 views22 pages

Advanced Computer Architecture Guide

1. Computer architecture deals with the design of instruction set, functional units and their interconnection at the hardware level. Microarchitecture refers to how a particular instruction set architecture is implemented in hardware. 2. Out-of-order execution overcomes data hazards by using techniques like register renaming, reservation stations, reordering buffers and dynamic scheduling to allow instructions to execute in any order as long as dependencies are addressed. 3. Caches exploit temporal and spatial locality to reduce the average memory access time by storing frequently used data closer to the processor in a hierarchical memory structure. Cache performance is optimized by reducing miss rates and miss penalties through techniques like larger blocks, higher associativity, write buffers and multilevel caching.

Uploaded by

Sridhar Gunnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Computer Architecture:

https://www.youtube.com/watch?v=TMJj015C93A&list=PLAwxTw4SYaPkr-
vo9gKBTid_BWpWEfuXe&index=76 - coherence
https://www.youtube.com/watch?v=fT4DJcNCisM&list=PLAwxTw4SYaPndXEsI4kAa6BDSTRbkCKJN
- consistance
Architecture vs micro architecture
Architecture -ISA
micro architecture - Organisation

Load store queue can be used to aviod load followed with stores

https://www.youtube.com/watch?v=bEB7sZTP8zc&list=PLAwxTw4SYaPkNw98-
MFodLzKgi6bYGjZs&t=22
note: WAW is problem in multiple issue design. or in OOO desigs
WAW and WAR are also called as false dependencies,(if instruction i3 is
writing to register R3 and i5 is writing to register R3 then instruction after
i5(i6,i7,i8...)
should see R3 by value written by I5)..

To overcome control dependencies: - Branch predictions


To overcome false depnedencies:
1)Register duplicating : Hold the register values say R3 in the above
example and give the latest ones to the instructions

2)Register Renaming : This divides the registers into architectural


registers and Physical registers
Architectural registers : What
programmer see
Physical register: what in hardware
that happens

A register allocation table (RAT) will


be used to map Architectural registers to physical registers
R1-->P1
R2-->P9
R3-->P2 and so on

Renaming will only removie output


dependencies (WAW) and anti dependencies (WAR) it cannot remove RAW.
To overcome RAW in narrow issue : forwarding + stalling
To overcome RAW in wide issue : - OOO

1)Tamosulo:
Issue: Instruction is sent to reservation stations by going
through RAT (now rats maps regsiters-->reservation stations)
If rservation stations are full then we need to
stall.
Dispatch: Once the operands are avaiable dispatch the
instruction
If multiple instructions are ready to
dispatch then we can do one of these three
a) send oldest firstname
b) send the one which has more
dependencies -- power hungry and difficult
c)Random
Broadcast: Once the instruction is complete broadcast the
results to reservation stations. And make that particular RS free.
If multiple instructions are ready to
broadcast then give preference to slower execution unit.
Tamasulo drawbacks :Execptions handling
create problem in OOO execution
(becuase many instaructions will execeuted instead of waiting)
Branch mispredictions: div 40
cycles
beq dependtednt
add -> already done (say R3
has sum already)
but if branch is not taken
then its a problem
Register values are stored out of
order - to overcome this we use ROB
2)ROB:
a) Execute OOO
b) Broadcast OOO
c) Deposit results in order
Reorder buffer is required to store the instructions so
that we can store the results in order
OOO has following steps https://www.youtube.com/watch?
v=0w6lXz71eJ8&list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_yjHs&index=6
Issue: Instruction is sent to reservation stations and to
ROB's in order (yaa order al kalsidivi anta gnapka itkobeku)
Dispatch: Once the operands are avaiable dispatch the
instruction
once the instruction is dispatched then RS is
made free
Broadcast: Once the instruction is complete broadcast the
results to reservation stations
and store the result in ROB.(ade order al
store madbeku)
Commit: Store the result computed in order using commit and
issue pointers
Instruction I5 na commit madbekandre I4 tanka
ela commit agirbeku illa andre ROB li wait madthirathe
change the RAT to point back to register
Commit pointer and issue pointer present in ROB helps in
understaning which is the previous commit and which is the latest issue.

Reservation station entry will free up as soon as it


dispatchs
Register write is in order
Tag entries will be ROB entry not Reservation station entry
like in Tamasulo.
No probelm with branch mispredications and exceptions like
in Tamasulo.
recovery is done by pointing RAT to registers
emptying
the the ROB after branch mispredictions
deleting
data in RS and EX
Control dependencies - branch predictors
Data dependencies - RAW - OOO
WAR,WAW - Register renaming
Memory dependencies - Load Store Queues
Store to load forwarding (store will be
writing data but load will get data from queue)
Superscalar: needs to fetch more than one inst per cycles
eg: 2 OOO MIPS units --> 2fetch,2decode etc
VLIW and Superscalar:
VLIW one bigg instruction per cycle,no ooo, it completely depends on
compiler (static sheduling done by compiler)
Superscalar N small instructions , will do OOO and NOT depends on
compiler (dynamic sheduling done by hardware)

Data hazard
schedule
stalling
bypass
speculate
Cache
Why cache was needed? DRAM - 1000 cycles
How to place block in the cache - Direct , set asso, fully asso (address mod
cache_lines)= block where data has to placed.
note cache_lines =1 for fully asso
What will cahce exploit - Temporal locality and spatial locality
How block is found?
Block offset: determined by how many bytes in each cache line
Index:which cache line (cache_line everywhere here is cache_set)
Tag:which memory address

More associtivity requires more tag comparators and less index bits

Misses:
Compulsory miss(initially),
capcity miss(no space in cache),
conflict miss(even though there is space if you want to write in same address
line eg.Direct mapped cache)
Coherney misses

Fewer cache blocks -- more conflicts]

Which blocks to replace:


Not requried in Direct mapped cache
In set or fully associtivity
-random replacement (more miss rate)
-LRU (using aging bits) (less hit time)
implemented using LRU counters
-MRU
-NMRU - not most recently used - just save MRU from eviction , and any of the
remaining blocks can be evictied
better than random in terms of miss rate and better than LRU in
terms of hit time
-PLRU- Pseudo LRU: Each block is associated with a bit which is set when
cache is accessed
replace blocks only when it is not set.(i.e. it
is not used recently)
better than NMRU in miss rate and better than
LRU in hit time
https://www.youtube.com/watch?
v=8CjifA2yw7s&list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSedj&index=57

What happens on write


Cahce hit : write through cache and memory - simplifies coherence
write back to cache only - less traffic , use dirty bit
Cache Miss:
no write allocate: write only to main memory
write allocate: fetch into cache

Cache Optimizations:
1. Larger block size - compulsary misses
2. Larger cache size - capcity misses
3. Higher Associitivity (less cnfilict)
4. Reducing Miss penalty, hit time- multilevel cache
5. Giving reads priority over writes - using write buffers

Cache Performance:
Mem access time = hit time + miss rate * mis penalty
We can increase cache performance by reducing hit time or miss rate or mis
penalty
Redcuing cahce size will decrease hit time but increases miss rate
Reducing associtivity will decrease hit time but increases miss rate
Cache pipelining: L1 cahce which consumes more than 1 cycles can be easily
pipelined
index--> reading data - one stage
comparing tags - one stage
selecting block using block offset - one stage
pipelining reduces wait time for consiqutive cache
hits
[even if I1,I2 are hits I2 need to wait till I1
cache oper is done]
Physically indexed physical tagged cahce:
VA-->TLB-->PA--
>Cache
Adv: when
processces changes (context switching) TLB also changes but here it is not a
problem
Disadv: SLOW
Virtually indexed virtual tagged cahce:
VA-->Cache VA--
>TLB(if there is cache miss)
Disadv: On
processes change we need flush the cache (because now VA translations are
different)
Many
virtual address may map to same cache (Aliasing)
adv:SPEED
Virtually indexed physical tagged cahce:
VA(index)--
>Cache , VA(tag)-->TLB both happens parallely
Adv: No cache
flush on context switching (better than VIVT)
Fast -->
better than PIPT
Disadv: need to
solve aliasing (2 VA's pointing to same PA)
Mixture of set assocoiative and direct mapped cache:
In set associative cache(low miss rate) Instead of reading
and checking tags,
speculate which block gonna win and choose that (less hit
time)
this is called way predictions
Prefetching:
Speculating what would be next memory access and getting
that to cache.
right speculatation - less latency
wrong speculation - correct ag iro block na tegdu yado
tandange cache ge - this is called cache pollution
Prefetchers:- Stream buffers- if A then get next
consecutive blocks.
Stride based prefetcher -A...B...C...
Co realtion prefetehcer - ABC....ABC
detects patterns of memory acesses

Flynns taxonamy: https://www.youtube.com/watch?


v=WKXbvhkzBUo&list=PLAwxTw4SYaPkr-vo9gKBTid_BWpWEfuXe&t=4
SISD 1 instr operats on 1 data stream
SIMD 1 instr operats on many data stream - vector proc
MISD Multiple instr operats on one data stream - stream proc
MIMD - multi instr and multi data

Multi_processors:
NUMA - Non uniform memory access time
group of clusters where clusters are -->core+cahce+memory (almost a
uniprocessor system)
Each cluster access other cluster memory through message passing

f--N == f/4---4N ,if f=f/4 and we can reduce the volatge also. so power
efficient.
-only if we parallelize to a good extent
ILP- pipelining, OOO ex,
DLP- SIMD
TaskLP -diff tasks

Loosely coupledMP: cores hve different address space ....not shared memory
Tightly coupled MP: cores have same adress space.. shared memory

Multi threading:https://www.youtube.com/watch?
v=ZpqeeHFWxes&list=PLAwxTw4SYaPkr-vo9gKBTid_BWpWEfuXe&t=405
Coarse grained - switch on event like cache miss
Fine grained - ever cycle -different threading
Simultaneous MT - Multiple thread istructions at the same time

Consistency and coherence:


ordering of mem operations to Same mem locations - coherence
ordering of mem operations to different mem locations - consistency
consistancy usally happens due to ooo of instructions
-- one processor does oper a,b,then x from other proc may reach this proc.
but in other proc oper x,y is performed and then it may see 'a'
-- global order for memory operations

conistency is also a memory model (consistent between programmer and machine)


ordering the mem_operations to same ---coherence -->crucial to implement
memory models
Coherence - cache to cache transfer can be done
- Invalidation or updation are two options that can be done
We can have 4 options for cache coherence
1. Snooping and updataion
Optimization 1: Reduce the memory writes (Memory traffic)
Problem: C1 writes to address A1, then C1
also writes to main memory and cache of C2
Every write access main
memory
Solution: Introduce 'dirty bit', C1
writes to address A1 and if A1 is present in cache of c2,c3
then update that. BUT DONT
WRITE to memory
Memory writes only happen
when cache is evicted
The core which performs
most recent write will have dirty as 1 meaning it is resposible to write back to
memiry

Optimization 2: Reduce the Bus writes (Bus traffic)


Problem: C1 keeps writing to address B
(assume no one has B)
C2 keeps writing to address
C (assume no one has C)
Every write-->results in BUS
traffic-->C2 will serach weather any core has C-->everytime.
Every write-->results in BUS
traffic-->C1 will serach weather any core has B-->everytime.
Solution: Introduce 'Shared bit', shared
bit is one if block is present in many cores (Bus communication should happen)
If shared bit is zero then
no need to write it to bus.(so with shared and dirty bit --> No bus traffic and mem
traffic)
Burst writes is bad.
Producer consumer is good.

2. Snooping and Invalidation


Every write into the local cache invalidiates all other
caches.
So next write into the same cache will not results in bus
access.
(because first write ensured that other caches are ivalid
and this is the only cache which has the data)
Invalidation is good for frequent writes (Burst writes) by
one core. (because first write ensured that other caches are ivalid and this is the
only cache which has the data)
Invalidation is bad when one core writes and other cache
reads the data (producer and consumer).

Own State:(reduces memory traffic)


eg: C1 has some data and C2 reads it and C2 writes it
later C1 reads it and and C1 writes it
without own: C1 write C1-->M
C2 read C1-->S C2-->S, memory is
updated (bcz shared indicate mem is clean)
C2 write C1-->I C2-->M
C1 read C1-->S C2-->S memory
is updated (bcz shared indicate mem is clean)
C3 read then memory responds
since C1 and C2 are in S
here no memory operation was
required
Owner: Resposnible to send data to other cache_line
Responsible to write back to memory.

Exclusive state: Data is present only with me not with anyone else
(reduces BUS traffic)
Take example of Read A
Write A
in MOSI - I-->S-->I-->M
(2 bus transaction)
in MESI - I-->E-->M (1
bus transaction)

3. Directory and Invalidation


In Snooping Bus is the bottleneck. So cannot be used for >8
to 12 cores.
In directory based protocals each mem block will be having
a directory
Block A: |dirty bit|core 0|core 1|core 2|core 3|
core0 bit: tells block A is present in core A or not.

The main advantage of directory based protocals is that if


a core C1 access A
and core C2 access B. Then these two can happen almost
independently because there is no
shared bus.

4. Directory and updation


Where conherence protocal should be implemented:
1. L2 cache: Not L1 because of the inclusion property. i.e.
L1 cant have something which is not there in L2.
2. In multiple multiprocessor system: Cache coherence
should be done in interconnect (AMBA,etc).

Cache Coherence Impl:


Snoop based: MOSI : Every cache line is goverened by state transition diagram
Invalidation :
Share :
Modified :
Own :Multiple caches may hold the most recent
and correct value of a block and the value in main memory may or may not be correct
shared to modigfied transition is called as
upgrade
MESI : Invalid: cache seri ila tagobeda
Modified: Nane change madidini inyar hatranu
ila
Exclusive: idu nan hatra and mathe main memory
li matra irodu
Shared :nan hatra irodu bereavr hatra nu ide
main memory luu ide

The exclusive (E) state in MESI protocol states that the cache block is
clean, valid (same value in the main memory)
and cached only in one block whereas the owned (O) state in MOSI protocol
depicts that the cache value is dirty
and is present in just one block.

Updation instead of Invalidation - Takes more bandwidth of BUS

Coherence instroduces new type of cache misses


Coherent misses:
True sharing: Normal coherent miss (C1 reads A and if C2 writes A, C1
will miss A in next access)
False Sharing: Worst sitituation when thread1 access 'a' and thread2
'b' then it ping pongs from invalidiate to shared (note 'a' 'b' are in same cahce
line)

Pipelining:
- structual hazard - lack of resources
- RAW - true dependency - data hazard
- WAR and WAW makes OOO difficult
forwarding and stalling used to aviod hazards
- Branch has only one cycle delay slot
- Branch was resolved in ID stage itself
- load followed by normal isnst
normal inst in exec stage will get loaded value from mem stage
through forwarding II unit.
- store followed by normal instead
no problem
- store followed by load
- no problem
_ Load followed by store
- no problem

Branch Prediction:
-Static branch prediction - we choose either it will be always
taken/not taken/not
-Dynamic branch predcition - 1 bit -prediction bit will change if
descision is wrong
-2 bit - prediction bits follow
a state diagram
2 consequetive
wrong decision results in decsion change
BHT- branch history table (holds
history of branch 1bit,or 2 bit like above)
BTB- branch target buffer (holds
branch PC)
Initial state: Good to start with
weak states than from strong states
but in some
corner cases like (T,N,T,N) starting from weak state will always give wrong
prediction
-History based predictors
Used when there is a pattern T,N,T,N,T,N TT,NN,TT,NN
Learns and filles BHT during inital iterations
then onwards it will be always right (eg: if last decision is
false then next desicison should be true)

- Tournament predictors
amalgmation of two predictors(P1 and P2) which is good for 2
differnt branch patterns
there is one more meta predictor which selects either from
predictor1 and predictor2
P1 and P2 is trained as above but meta predictore should also be
trained

Virtual Memories:
MIPS provides 32 bit address space i.e. load word and store word can
have address of 32bit
so programmer has 4GB in his hands.
But if your ram is just 1GB. Then U NEED VIRTUAL MEMORY.

Each process has its own virtual memory and corresponding page tables
If we have 4GB of virtual memory (eg MIPS) then to map each virtual
addrress we need page table with 4G entries
Hence to overcome his memory is divided into blocks called pages
A virtual memory page is mapped to physical memory page through lookup
tabels called page tabels
Virtual memory provides indirection or MAPs the program address to
physical(RAM) address.

Virtual address - [Virtual Page number + page offset] -->Page table


---> [Physical page number + page offset]
One level page table: For a 64bit machine we need (2^64)/(page_size)
entries in page table
If we are using only 2 applications
which requires 2 pages
even then we need 2^64 entries which is
waste of memory
so we go for two level page table
Two level page table: Virtual address =
outer_levl+inner_level+page_offset
inner page tables are not used then we
will not allocte in page tables
in short yest beko as hakthivi
Multi-level page tables reduces page
size}
Page size: If page size is tool small, then out page table entries
increases
if page size is too big, then if suffers internal
fragmentation
internal fragmentation : say process needs memory of
10 and if you page size is 9
you will
allocate 2 pages and in the second page most of memory
will be wasted
Translation Look Aside Buffer - Read from page table is time consuming.
So we need cache which holds the
translations of VA-->PA - this is TLB
this is different from normal
cache bcz normal cache holds data where this holds just address
so its size is very less.
This is usally fully
associtivitive or set associtivitive cahce (because size is very less)

Page fault - page==block

Clock Domain Crossing Verification:


Problem arises when dst freq is slower than the src frequencys
the output of src flop should atleast cover 3 edges of dst clk (~1.5x clk
period) this is called three edge rule
Open loop solution - no ackno
- fast
- but need to maintain syn_data_pulse_width >=
1.5x dst_clk_period
Close loop solution -ackno
-slow
A second potential solution to this problem is
to send an enabling control signal, synchronize it
into the new clock domain and then pass the
synchronized signal back through another
synchronizer to the sending clock domain as an
acknowledge signal.
binary to gray: assign gray = (bin>>1) ^ bin
gray to binary bin[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0];
bin[1] = gray[3] ^ gray[2] ^ gray[1];
bin[2] = gray[3] ^ gray[2];
bin[3] = gray[3];
Multi bit crossing:

consoildate multi bit signals


Encode multibit to grey codes
use FIFO
https://www.youtube.com/watch?v=eojd6wDJWXg&list=PL589BOiAVX7YrBMuS6TdVjvTKl-
8qNj0L
multiple cores,buses,bridges - have different frequencys
Metastability Issues (neigther one nor zero): becasue of timing violations
we should synchronizers or asynchronus FIFO
Synchronizers are not good bcz of metastability issues hence we go for FIFO
Even in FIFO to compare read and write pointers we need synchronizers.
Synchronizers are nothing but series of d flip flops i.e.
eg: D(clk1)-->D(clk2)-->D(clk2) - Double synchronizer -- prob of metstability
is less
D(clk1)-->D(clk2)-->D(clk2)-->D(clk2) - Triple synchronizer -- prob of
metstability is even less
Clock jitter can be used to model metastability. (Global clk jitter or local
clk jitter)

Reset Domain crossing


Asynchronus resets require reset synchronizer ckts. (becasue if reset is
close to posedge of clk it cause metastability)
Reset signal also has tree called reset tree similar to clock tree.
Reset going from domain1 to domain2 should be synchronus to domain 2.
This is done using synchronizing flops.
Synchronizer circuit results in reg to reg path between two synchronizer
flops
Recovery time: setup time of this reg to reg path
Removal time: Hold time of this reg to reg path
https://www.youtube.com/watch?v=KltPjkYboto

Constrained random verif:

rand:it can repeat


randc:it cannot repat untill it covers all values

create a class with rand objects


declare class handles or ojects
randmoize the object using function randomize();
Constriant is a block used to set constarints for rand object
with and soft keywords

Basics of OOPS:

Encapsulation - Data + functions in a containeer called class


class data members- properties
class tasks and functions - methods

Dynamiclly declared with a built in function new();


memory is allocated in runtime.

new() allocates class memories


Inheritance: Provides power to change /reuse codes
https://www.youtube.com/watch?v=C0-2AcnRwAs
Inheritance helps in resuing the code.
i.e. say you have a contraint 'abc' in the main class or base class.

What if you want to edit that constraint. you can use inheritance

class ext_xyz extends base_class


new_constarint;
endclass

base_class bas_obj;
ext_xyz ext_obj;

gen.bas_obj=ext_obj;// now base_obj got exteneded constaintrs


this is called as polymorphism

if you want to edit function using extended class then you need to define
function as VIRTUAL in bas the base class.This is also called as overriding
https://www.youtube.com/watch?v=dX2ojPL0Y5M

Operator overloading (eg: addition of ojects using same '+')


overriding functions
polymorphism (same function name different meaning)
inhertiance (dervied class having same features of the base class)
when you call constructor for child class, it will create memory for
parent class too
where the parent class objects can be accessed by super.var and child
by this.(var)

Calling child constructor will also call parent constructor - Chain


constructors

Copy Constructors or explicit copy - performs deep copy


Constructors: Constructors are defined using class named
Constructors can be overloaded (function overloading just
like operator overloading) by changing no. of parameters
Constructors without parameteres are default constructors
Shallow copy: only class handler is copied but objects have same memory
if u have done shallow copy a.var and b.var will refer to
same location.
Deep copy: Even the object parameter is copied now a.var and b.var are
different
https://www.youtube.com/watch?v=G08IvlLxtAY
Data hiding:(This local and super)
this : is used to represent local variables
super : super used to represent parent variables
local : If this is used in parent class then the child class cannot see
this vaiable (hidden)
Abstract class and pure virtual functions

Abstract class is also called as virtual class. i.e. declaring the class with
keyword virtual
these virtual classes cannot be instanced it can just be extended.
i.e. virtual class abc endclass
abc var; XXX not valid

To declare virtual functions inside virtual class we should use pure


virtual functions;

https://www.youtube.com/watch?v=Mb1y-L-ZjD0&list=PL792F8AED9E6F3E9F
TO BE READ:
public,private classes
friend functions
static, const declarations

static function() - function is static


function static (); - function parameteres are static

Scope resolution operater ::


It is used to define the function
(or method) outside the class.
But method has to be declared in
the class.
Friend Function: A function which can access the class variables and methods
even if it is defiend outside the
class. But it should be declared as friend
function inside the class.
Access Modifiers:
Public: class object in main function can access,dervied class can
access, inside class also we can access
Protected: dervied class can access, inside class we can access BUT NOT
using class object in main function can access
Privite: Only classs where this variable is can see it.

SYSTEM VERILOG:

cover,assert,sequence,property
https://www.youtube.com/watch?
v=siXo9_fCt7k&index=3&list=PL589BOiAVX7YFqkUsZjcoLVxfkkInhXh3
sequence ##1,##2,##[1:2],
consecutive repitation [*2] start,start,x,x,x,x,x,x or [*3:5]
non consecutive repitation [=2] start,z,z,z,z,start(match ends here),z,z,z,z,
Goto non consecutive operator [->2] start,z,z,z,z,(match ends
here)start,z,z,z,z,
overlapping sequenence implecation operator: |-> just like p(a/b),, ensure
that b is done when a is done
NON overlapping sequenence implecation operator: |=> just like p(a/b),,
ensure that b is done after one cycle when 'a' is done
We can also have named sequences and properties. sequence xyz=nwjkgfdwjgfwg;
end sequence then use xyz everywhere
Every asseration have an action block which will be immediateley exceuted
like error,warning
$onehot,$isunkonown,$rose,$fell,$stable,$past, throughout

Assertion patterns:
How and where to create assertions
- $rose (req) |-> (req throughout (~grant[*0:8] ##1 grant ##1 start)) -
template patterns that can be used by others
- assert property (@posedge clk disable iff (~rst_n)) !(^var_name==1'bX) - x
detection
- assert property (@posedge clk disable iff (~rst_n)) !(var_name<=7 )- valid
range
- assert property ((@posedge clk disable iff (~rst_n)) ($rose(start) |=> (!
start throughout done[->])))- bounded window

For state machines write auxilary state machines in tb and write properties

Case study of AMBA APB bus:https://www.youtube.com/watch?


v=Uq2wBqV2ARk&list=PLE995EF1D8630E34F&index=15
AHB - high performance bus connecting proc to mem_operations
APB - low performance bus connecting pheripherals like UART,timer etc
AMBA bridge - communicator between AHB and APB
APB bridge is the master for pheripheral devices
Each periperal device (Slave) has its own select signal and comman enable
signal both driven by bridge
ACE- Cohernce extension
AXI- extenseable interface
CHI- coherent hub interface
CCIX- Cache coherent interconnect for accelators
Transactions are higher level of abractions provided to drivers for eg:
Test- -How many reads,writes -Highest level of abstraction
Generators -Generation random read/write -One level down abstraction
Transactions -Generates control bits to perfroms read/write- One level down
abstraction
Driver -Provides bits to DUT -lowest level of
abstraction

Formal Property checking


FV required because simulation techinuques are bad
FV requires no stimulus and generator logic
In FV there are no legal set of inputs but design is verified by mathematical
techiques and proofs
For each new unique initial state tool will calculate next uniques states and
asserts if there is a bug
FV- state explosion - good for non sequential blocks, non math
transformations (non pipelined buses)
Simulation verif - time explostions

https://www.youtube.com/watch?v=Es-
rRRI7Bq8&index=1&list=PL589BOiAVX7YFqkUsZjcoLVxfkkInhXh3

checker provide observeability


stimulus provide controllability
assertions improve observeability (bugs does not require to travell all the
way to output)
Coverage metric to controllability

low level asserations - Implementation focused--bind in the RTL - Design


Engineers -White box asserations
high level asserations - Architectural focused--bind in tb ---verif engg --
black box asserations

linear tb: normal testbench, simulator should create lot of events, bad
linear random tb: good but covers unnesscary cases,no coverages
directed verif: the engineers writes differt test cases in a test case file
and uses those files
but writing all test cases for a big design consumes
lot of time.
What if we get scenarios which we never thought.
Contraint random verif: The tb generates testbenchs automatically
less simulation time
Will not provide info about how well
design has been verified

Code Coverage :how well ur design code is covered


line coverage,block coverage, branch coverages
conditional coverages,fsm coverage-state,transition
coverage-state
path coverage,toggle coverage
Code coverage should always be 100% but it is not
suffienct enough to verify the design
Checks how thooughly the code of design is exerised
by the testbenches

Functional Coverage (CIT)- item FC: 'a' should take values from 4 to 10 and
'b' should take values from 0 to 1
Cross FC: item_a ,item_b (b=0
iddaga ela 'a' values change agbeku and b=1 iddaga 'a' values change agbeku )
Transition FC: a should take value
only in this order 4-5-6-7-8-9-10
Assertion based FC: by using cover,
property, assert
Note: there are two kinds of monitors active and passive. Active can drive
DUT but passive cant.
Checks operational features : like
push,pop,full,empy in fifo

Verification plan: (what to verify and how to verify)


1. Build the verification plan. (list of test cases,
list of features to be tested)
2. Stilumlus generation plan
3. Coverage plan
4. Checker plan
5. Integrate and test with random
6. Check with Contraint randomize
7. Analyze coverages
8. Directed to fill coverage holes.
Test Scenarios:
1. What if DUT receives an invalid input. (giving
address of 1111 for a 8 block memory)
2. If we start from invalid state can the design recover
(Giving invalid opcodes and seeing its recovery)
3. If valid inputs are given weather DUT goes to invalid
state.
4. Wether it goes to valid initial state when we power
on (Correct Initialization)

REMEBER TB BLOCK DIAGRAM

User Defiend datatypes: class, struct, unions, enumeration, typedef


enum {a=3, b=7, c} alphabet;
d = int'(c + 1); // using casting
structures can be packed or unpacked
unions - memory alloacated for only one
data {with max size}
typedef bit[3:0] nibble

Arrays
1. Static Array: No run time change
of size}
eg:fixed array
packed and unpacked
memory is allocated in
compile time
cannot change the size in
runtime
2. Dynamic Array : We can change
size in run time-
a. Dynamic array: Can change
the size in run time
not good for sparse data
elements
dynamic array can only
be packed
memory is allocated in
run time
dynamic array has
methods like new,delete to allocate or disallocate memory
b: Associitive
array:https://www.youtube.com/watch?v=qTZJLJ3Gm6Q

https://www.youtube.com/watch?v=Bts4c-sPOiE

Array index can have different datatypes like strings,etc

It uses hash funtions to map to regular indexes.

It is usefull when data is sparse

we should declare datatype of data and also data type of indexes

datatype of index can be *(any datatype) i.e. int a [*];

c: Queue array: declared as


int a [$];
very good in designing LIFO
and FIFO
Dynamically allocates and
deallocates memory

push_front,pop_front,push_back etc methods can be used.


We can also declare array's
of queues,queues of queues, Darray of queues,A array of queues (see testbench.in)
prarmeter is a eleboration time constant

const is a run time constant


The Verilog assign statement is a unidirectional assignment.
To model a bidirectional short-circuit connection it is necessary
to use the alias statement.
Static and automatic variables:
https://www.youtube.com/watch?v=omkaX-0hoYA
Static variables:
static variable are executed only once. This the
deafult mode.
for i:n
static int j=0;
j=j+1;
display j
end
output will be 1,2,3,4,n
variable j is not re initialized whn the loop
is entered again, it holds the values
Automatic Variables:
for i:n
automatic int j=0;
j=j+1;
display j
end
output will be 1,1,1,1, n times
here automatic int j=0; is initialized each
time when the loop is entered
Casting:
Static casting ''' operator
unsingned'(-10);
the content inside () is assigned in eloboration time

Dynamic Casting '$cast(type a,type b)' ;


the content inside () is assigned in run time
$cast can be used in polymorphism to achieve inheritence

interface can be used to generate clk, moniter signals and also


as checker!!
best interface explanation:https://www.youtube.com/watch?
v=jw_FgGUBJ4w&index=4&list=PL589BOiAVX7ZDxHpeEW_jTReZl_8oMwd-
to connect interface to a class we can use interface as an
oject ..SAD!!!
so we do this
class abc;
virtual interface intf; //create a virtual interface inside class
function new(interface_var intf) // during declaring class
object, pass the interface to be connected to this class
this.intf=intf; // connect the interface sent to the virtual
interface of this class
endficntion

task ....endtask // now you can access interface variables


endclass

VLSI Design:
SRAM DRAM and Memory model basic :https://www.youtube.com/watch?
v=7k_3EAkKfak&list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSedj&t=35
DRAM nead constant refreshing : Destructive reads (we need write back once we read
something)

Miscellaneous:

Coherent hub interface


https://www.youtube.com/watch?v=iOh_TccRgm4
Stimuls
Checking coherncy of SOC by giving stimuls,transactions
Active mode - before RTL
Passive mode - After RTL
Interconnect moniters,scoreboards
virtual function, pure function in c++.
An question about how to write a c program to judge whether a machine is big-
endian or little-endian
About the refresh in DDR2
Fibonicci series
Then coverages
he von Neumann bottleneck is the idea that computer system throughput is
limited due to the relative ability \
of processors compared to top rates of data transfer.

Bit Manipulation techinuques:


n&(n-1)=0 if n is power of 2
while(n)n&(n-1)cnt++; number of 1's
<!-- n&(1<<bit_number)==1; ith bit is set or note -->
<!-- for j=0:num_of_bits, if(rainbow_number & (1<<j)) print(rainbow(i)); -->
<!-- x & (-x) == x ^ ( x & (x-1)) == Returns the rightmost 1 in binary
representation of x. -->
to get 2's complement come from right side till u get one, then flip all left
bits from there
to get x-1 find rightmost 1,then make the rightmost one zero and flip all the
right bits

UVM:
one of the great advantages of UVM is It’s very easy to replace
components without
having to modify the entire testbench, but it’s also due to the concept
of classes and objects
from SystemVerilog.

UVM has many classes which we can derive to build our own
eg: UVM_component,UVM_driver,UVM_agent,UVM_sequencer etc

UVM Phases: A brief explanation of each phase will follow:

The build phase is used to construct components of


the hierarchy.
For example, the build phase of the agent class
will construct the classes for the monitor, for the sequencer and for the driver.
The connect is used to connect the different sub
components of a class.
Using the same example, the connect phase of
the agent would connect the driver to the sequencer and it would connect the
monitor to an external port.
The run phase is the main phase of the execution,
this is where the actual code of
a simulation will execute.
And at last, the report phase is the phase used to
display the results of the
simulation.
UVM Macros:

Note:

1) The interface is a module that holds all the signals of the DUT.
The monitor, the driver and the DUT are all going to be connected
to this module.

2) Registering the interface in the UVM factory. This is necessary


in order to pass this
interface to all other classes that will be instantiated in the
testbench.
It will be registered in the UVM factory by using the
uvm_resource_db method and every block
that will use the same interface,will need to get it by calling
the same method.

3) A transaction is a class object, usually extended from


uvm_transaction or uvm_sequence_item classes,
which includes the information needed to model the communication
between two or more components.

4) After a basic transaction has been specified, the verification


environment will need to generate a collection
of them and get them ready to be sent to the driver. This is a
job for the sequence.
Sequences are an ordered collection of transactions, they shape
transactions to our needs and generate
as many as we want. This means if we want to test just a specific
set of addresses in a master-slave
communication topology, we could restrict the randomization to
that set of values instead of wasting
simulation time in invalid values.

5) To demonstrate the reuse capabilities of UVM, let’s imagine a


situation where we would want to test a
similar adder with a third input, a port named 'inc'.
Instead of rewriting a different transaction to include a
variable for this port, it would be easier
just to extend the previous class to support the new input.

Packages and macros https://www.youtube.com/watch?


v=4HorsJ9Sf8s&index=5&list=PL589BOiAVX7ZDxHpeEW_jTReZl_8oMwd-
`define add a+b //macro
`include filename.svh //like header file includes classes

package ALU_pkg
`include class_add.svh
`include class_sub.svh
`include class_div.svh
`define mul a*b;
endpackage

import ALU_pkg::*

module tb;
begin end
then work on it

Similarly UVM has its own package containing many classes

UVM components:
UVM is large set of classes. In these classes many are dependent on one
another.
eg. UVM_test is derived from UVM_component

Each of the component class has these default virtual methods inside
the class defination
1.function void Build_Phase(); when this is called it calls build
phase of all classes in the upper hierarchy.
2.function void Connect_Phase(); when this is called it connects
all the classes
3.task void Run_Phase(); It is the only task as it can take
delays, when this is called it runs the test (like initial block)
4. function void Report_Phase(); reports the data

vsim +UVM_testname=dog
vsim +UVM_testname=cat we can call different test from the command line
how to do it:
class dog extends UVM_test;
`UVM_component_utils(dog); // tell to UVM factory that
there is a test called dog which should be build upon arguments

function void new()


endfucntion

task run_phase endtask


endclass

When we call UVM test


its build phase builds the environment
envoronment build phase builds the agents
Then UVM run_phase will be executed

Possible functional covergages:


1.All possible values of variables
2.Values A take all zeros and value B take all ones
3.Bins can be used to check the range eg : x has
taken values in the range 0 to FF?
4.Bins bin_example[number of bins]={[1:2],[2:6]}
coverts to {1 2 2 3 4 5 6}
5.Transition coverage rst => multiply and
multiply=>rst
6.Transition coverage for pairs {[1:2]=>[2:6]} 1
followed by 2, 1 followed by 3.....all transition will be checked
7.Consecutive repitation [* n] n operations in a row,
[* n:m] (n to m) operations in a row
8. Non consecutive repitaton [=n:m] (matchs even if
there is terminating condition)
eg. rst=>mul[=2]=>rst :
rst=>mul=>add=>mul=>and=>or=>rst (no need to end with mul followed by rst)
9. Goto Non consecutive repitaton [-> n:m]
eg. rst=>mul[->2]=>rst :
rst=>mul=>add=>mul=>and=>MUL=>rst (should end with mul followed by rst)
10.Cross coverage:
A=00,01,02 and B=11,12,13 and oper
add,and,mul,sub
to verify all possible combinations
cross a,b,op;
11. Binsof operator
in the above example do bins abeing00=binsof
A00;{a00 b=11,12,13,op=mul,add etc} all comes to one bin
we can also do logical operations on
these binsof to generate new bins
eg
binsof(somebins)&&binsof(somebins)
12.ignore bins can be used to ignore bins wich we
dont need.

FIFO Depth Calculation:


The logic in fixing the size of the FIFO is to find
the no. of data items which are not read in a
period in which writing process is done. In other
words, FIFO depth will be equal to the no.
of data items that are left without reading.

https://hardwaregeeksblog.files.wordpress.com/2016/12/fifodepthcalculationmadeeasy2
.pdf
Depth=burst-total time to write burst/time to read
single data
For worst case scenario, Difference between the data
rate between write and read should be maximum

PERL:
https://www.youtube.com/watch?v=WEghIXs8F6c
sclars,arrays and hashes
sclars my $var_name='Derek';
"" === qq{}
say "5+4=",5+4;
rand,log,sqrt,int,hex,oct,exp

eq ne lt gt ge if('a' eq 'b') string comparisions


C language break == last
C continue == next
scanf ==> $var_name=<STDIN>;
Switch ==> given ($age)
<!-- when($age<18)say "drive" continue; -->
<!-- when ($age>21 say "hi" -->
<!-- default {say "nothing special"}) -->

Ask Vivek:
Accessing multi dimenstional array elements
About clock domain crossing and reset domain crossing
Can we do clk gating in verilog?
breif about MSI,MOSI,MESI,MESIF
queues and fifo's? Where they will use in testbenches?
Given read and write freq, how to calculate FIFO depth?
When were we stalling in MIPS?

Projects:
MIPS is a RISC, harvard architecture

Points in book:
Ring oscilator is one which has odd number of not gates
C=G+PC carry look ahead adder

No overflow and underflow in m_pot


Ensuring reset of m_pot after every image
WE and RE is not one in memory
If resting correctly or note
stored output spikes and compared with golden output spikes
calculated accuracy
Control logic verif: random delayed done signals
Calculating the accuracy based on spike count

Note:
test calls constructor of environment
environment calls constructor of driver,moniter and scoreboards
driver calls constructor of covergroup,stimulus

http://www.sunburst-design.com/papers/CummingsSNUG2008Boston_CDC.pdf

Associative Array methods:


size(),num(),first(index),last(index),exists(index),delete()

Queue methods:
push_back(),push_front,pop_front,pop_back,delete,insert,size

Streaming operators: packed={>>{unpacked_array}};


<!-- { << 4{ 8'b0011_0101 }} 4 bits from right to left ==> -->10101_0011

We can declare struct inside class and make it randomize(obj.struct_name);


<!-- Contraints can be changed (var<min_value)eg: min value can be changed -->

pre/post randomization functions are called when we call randomize(); we can use
this to initalize and randomize and assign to functions
randomize is by default virtual so it always call pre_randomize and post_reandmoize
of child class (lly constarints are virtual)
obj.rand_mode can be used to disable random variable
we can randomize non random variables by passing non rand var through rand function
Built-in method randomize() not only used for randomization, it can be used as
checker. When randomize() method is called by passing null,
randomize() method behaves as checker instead of random generator. It evaluates
all the constraints and returns the status.
obj.randomize() with { Var == 50;} inline contraints
Global constraints :constraints between variables of different objects
If a constraint block is declared as static, then constraint_mode on that block
will effect on all the instancess of that class

set membership operator: 'inside'


constraint range { Var inside {0,1,[50:60],
[90:100]};
constraint range { !( Var inside {0,1,5,6});}

Weighted distrubution:
Var dist { 10 := 1; 20 := 2 ; 30 := 2 } var will have
1/5% of 10,,,,, 2/5 % of 20

Implication:
constraint c { (a == 0) -> (b == 1); }
Constraints can have for loops (for constraining each element of array) and
function calls
randcase can provide prob distrubutions

implicit bins,
explicit bins, bins bin_name[7:0]=var_name;
default bins,
ignore_bins,
illegal bins
transition bins 3=>4=>5=>6,bins trans[] = (3,4=>5,6);
wildcard bins wildcard bins trans = (2'b0X => 2'b1X );
bins can be arrays eg. bins bin_name[4]={[0:7]};(creates 4 bins)

:: is used to access static properties of classes

You might also like