Computer History
Eckert and Mauchly
• 1st working electronic
computer (1946)
• 18,000 Vacuum tubes
• 1,800 instructions/sec
• 3,000 ft3
Computer History
• Maurice Wilkes
1st stored program
computer
EDSAC 1 (1949) 650 instructions/sec
http://www.cl.cam.ac.uk/UoCCL/misc/EDSAC99/ 1,400 ft3
Intel 4004 Die Photo
• Introduced in 1970
– First
microprocessor
• 2,250 transistors
• 12 mm2
• 108 KHz
Intel 8086 Die Scan
• 29,000 transistors
• 33 mm2
• 5 MHz
• Introduced in 1979
– Basic architecture
of the IA32 PC
Intel 80486 Die Scan
• 1,200,000
transistors
• 81 mm2
• 25 MHz
• Introduced in 1989
– 1st pipelined
implementation of
IA32
Pentium Die Photo
• 3,100,000
transistors
• 296 mm2
• 60 MHz
• Introduced in 1993
– 1st superscalar
implementation of
IA32
Pentium III
• 9,500,000
transistors
• 125 mm2
• 450 MHz
• Introduced in 1999
http://www.intel.com/intel/museum/25anniv/hof/hof_main.htm
Pentium 4
• 55,000,000
transistors
• 146 mm2
• 3 GHz
• Introduced in 2000
http://www.chip-architect.com
Core 2 Duo (Merom)
Pentium 4 Intel Core i7 (Nehalem)
Montecito (Itanium 2) Cell Processor
IBM Power 7
(SUN UltraSparc T3)
First Generation (1970s)
Single Cycle Implementation
Second Generation (1980s)
F D I E C
•Pipelinining: temporal parallelism
•Number of stages increase with each generation
•Maximum CPI = 1
Third Generation (1990s)
E
F D I E C
•ILP
•Dynamic: superscalar
•Out-Of-Order Execution (scheduling)
E
•Static: VLIW/EPIC
•Spatial parallelism
•IPC not CPI
•Instruction window
•Speculative Execution (prediction)
Fourth Generation (2000s)
E
F D I E C
E
F D I E C
E
Simultaneous Multithreading (SMT)
(aka Hyperthreading Technology)
The Famous Moore’s Law
Hardware Improvement
Positive Cycle
People ask for more of Computer Better Software
improvements Industry
People get used to the
software
How Did These Advances Happen?
• Restrictions
Wishes • Capabilities
Software Computer Process
Community Architecture Technology
• Performance Design
• Restrictions
Performance in the past
achieved by:
• clock speed
• execution optimization
• cache
Performance now
achieved by:
• hyperthreading
• multicore
• cache
The Status-Quo
• We moved from single core to multicore to
manycore:
– for technological reasons
• Free lunch is over for software folks
– The software will not become faster with every
new generation of processors
• Not enough experience in parallel programming
– Parallel programs of old days were restricted to
some elite applications -> very few programmers
– Now we need parallel programs for many different
applications
Old School New School
Increasing clock frequency is Processors parallelism is
primary method of primary method of performance
performance improvement improvement
Don’t bother parallelizing an Nobody is building one processor
application, just wait and run on per chip. This marks the end of
much faster sequential computer the La-Z-Boy programming era
Given the switch to parallel hardware,
Less than linear scaling for a
even sub-linear speedups are
multiprocessor is failure beneficial as long as you beat the
sequential
35
Slide Source: Berkeley View of Landscape
• Memory Wall
• ILP Wall
• Power Wall
Memory Speed:
Widening of the Processor-DRAM Performance Gap
Courtesy of Elsevier, Computer Architecture, Hennessey and Patterson, fourth edition
12
Power Density
Moore’s law is giving us more transistors than we can afford!
Scaling clock speed (business as usual) will not work
10000 Sun’s
Surface
Rocket
1000
Nozzle
Power Density (W/cm2)
Nuclear
100
Reactor
8086 Hot Plate
10 4004 P6
8008 8085 386 Pentium®
286 486
8080 Source: Patrick
1 Gelsinger, Intel
1970 1980 1990 2000 2010
Year
Multicore Processors Save Power
Power = C * V2 * F Performance = Cores * F
Let’s have two cores
Power = 2*C * V2 * F Performance = 2*Cores * F
But decrease frequency by 50%
Power = 2*C * V2/4 * F/2 Performance = 2*Cores * F/2
Power = C * V2/4 * F Performance = Cores * F
A Case for Multicore Processors
• Can exploit different types of
parallelism
• Reduces power
• An effective way to hide memory
latency
• Simpler cores = easier to design and
test = higher yield = lower cost
Cost and Challenges
of Parallel Execution
• Communication cost
• Synchronization cost
• Not all problems are amenable to
parallelization
• Hard to think in parallel
• Hard to debug
Attempts to Make Multicore
Programming Easy
• 1st idea: The right computer language
would make parallel programming
straightforward
– Result so far: Some languages made
parallel programming easier, but none has
made it as fast, efficient, and flexible as
traditional sequential programming.
Attempts to Make Multicore
Programming Easy
• 2nd idea: If you just design the
hardware properly, parallel programming
would become easy.
– Result so far: no one has yet succeeded!
Attempts to Make Multicore
Programming Easy
• 3rd idea: Write software that will
automatically parallelize existing
sequential programs.
– Result so far: Success here is inversely
proportional to the number of cores!
Qualcomm Snapdragon SoC
Hardware:
Krait processor (4-core ARM)
Adreno (128-core GPU)
Applications:
Nvidia Tegra SoC
Hardware
4-core ARM A57
4-core ARM A53
256-core GPU
Applications
Robotics, CV, Imaging
E.g., BMW Driver Assistance
Tilera Tile-Gx SoC
Hardware
100-core ARM
Applications
Network, Cloud, Security
E.g., 10Gbps Layer-7 Network Appl. Classification
E.g., 40Gbps Lossless Network Packet Capture
Google Data Center, USA
Parallel Languages/Libraries
Pthread
MPI
CUDA
OpenCL
OpenMP
OpenACC
and many more!
Conclusions
• The free lunch is over.
• Mulicore/Manycore processors are here
to stay, so we have to deal with them.
• Knowing about the hardware will make
you way more efficient in software!