Introduction to Parallel Computing
Blaise Barney, Lawrence Livermore National Laboratory
          What is Parallel Computing?
                 Traditionally, software has been written for serial computation:
                      o To be run on a single computer having a single Central Processing Unit (CPU);
                      o A problem is broken into a discrete series of instructions.
                      o Instructions are executed one after another.
                      o Only one instruction may execute at any moment in time.
                 Only one instruction may execute at any moment in time.
                 In the simplest sense, parallel computing is the simultaneous use of multiple compute resources
                  to solve a computational problem:
                      o To be run using multiple CPUs
                      o A problem is broken into discrete parts that can be solved concurrently
                      o Each part is further broken down to a series of instructions
                      o Instructions from each part execute simultaneously on different CPUs
                 The compute resources can include:
                     o A single computer with multiple processors;
                     o An arbitrary number of computers connected by a network;
                     o A combination of both.
                 The computational problem usually demonstrates characteristics such as the ability to be:
                     o Broken apart into discrete pieces of work that can be solved simultaneously;
                     o Execute multiple program instructions at any moment in time;
                     o Solved in less time with multiple compute resources than with a single compute
                        resource.
          The Universe is Parallel:
                  Parallel computing is an evolution of serial computing that attempts to emulate what has always
          been the state of affairs in the natural world: many complex, interrelated events happening at the same
          time, yet within a sequence. For example:
          o   Planetary movement                                o   Rush hour traffic
          o   Weather and ocean patterns                        o   Automobile assembly line
          o   Tectonic plate drift                              o   Building a space shuttle
          o   Ordering a hamburger at the drive                 o   Galaxy Formation
              through
Uses for Parallel Computing:
    Historically, parallel computing has been considered to be "the high end of computing", and has been
       used to model difficult scientific and engineering problems found in the real world. Some examples:
          o   Atmosphere, Earth, Environment
         o   Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
         o   Bioscience, Biotechnology, Genetics
         o   Chemistry, Molecular Sciences
         o   Geology, Seismology
         o   Mechanical Engineering - from prosthetics to spacecraft
         o   Electrical Engineering, Circuit Design, Microelectronics
         o   Computer Science, Mathematics
     Today, commercial applications provide an equal or greater driving force in the development of faster
      computers. These applications require the processing of large amounts of data in sophisticated ways. For
      example:
         o Databases, data mining
         o Oil exploration
         o Web search engines, web based business services
         o Medical imaging and diagnosis
         o Pharmaceutical design
         o Management of national and multi-national corporations
         o Financial and economic modeling
         o Advanced graphics and virtual reality, particularly in the entertainment industry
         o Networked video and multi-media technologies
         o Collaborative work environments
Why Use Parallel Computing?
     Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
      with potential cost savings. Parallel clusters can be built from cheap, commodity components.
     Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible
      to solve them on a single computer, especially given limited computer memory. For example:
          o "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and
              PetaBytes of computing resources.
          o Web search engines/databases processing millions of transactions per second
     Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing
      resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org)
      provides a global collaboration network where people from around the world can meet and conduct work
      "virtually".
     Use of non-local resources: Using compute resources on a wide area network, or even the Internet when
      local compute resources are scarce. For example:
          o SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a compute power over
              528 TeraFLOPS (as of August 04, 2008)
          o Folding@home (folding.stanford.edu) uses over 340,000 computers for a compute power of 4.2
              PetaFLOPS (as of November 4, 2008)
     Limits to serial computing: Both physical and practical reasons pose significant constraints to simply
      building ever faster serial computers:
          o Transmission speeds - the speed of a serial computer is directly dependent upon how fast data can
              move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the
              transmission limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate increasing
              proximity of processing elements.
          o Limits to miniaturization - processor technology is allowing an increasing number of transistors to
              be placed on a chip. However, even with molecular or atomic-level components, a limit will be
              reached on how small components can be.
                                                                                                             2
         o   Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger
             number of moderately fast commodity processors to achieve the same (or better) performance is
             less expensive.
      Current computer architectures are increasingly relying upon hardware level parallelism to improve
      performance:
         o   Multiple execution units
         o   Pipelined instructions
         o   Multi-core
Von Neumann Architecture
     Named after the Hungarian mathematician John von Neumann who first authored the general requirements
      for an electronic computer in his 1945 papers.
     Since then, virtually all computers have followed this basic design, which differed from earlier computers
      programmed through "hard wiring".
                                                     o   Comprised of four main components:
                                                              Memory
                                                              Control Unit
                                                              Arithmetic Logic Unit
                                                              Input/Output
                                                     o   Read/write, random access memory is used to store
                                                         both program instructions and data
                                                              Program instructions are coded data which tell
                                                                the computer to do something
                                                              Data is simply information to be used by the
                                                                program
                                                     o   Control unit fetches instructions/data from memory,
                                                         decodes the instructions and then sequentially
                                                         coordinates operations to accomplish the programmed
                                                         task.
                                                     o   Arithmetic Unit performs basic arithmetic operations
                                                     o   Input/Output is the interface to the human operator
Flynn's Classical Taxonomy
     There are different ways to classify parallel computers. One of the more widely used classifications, in
      use since 1966, is called Flynn's Taxonomy.
     Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be
      classified along the two independent dimensions of Instruction and Data. Each of these dimensions can
      have only one of two possible states: Single or Multiple.
     The matrix below defines the 4 possible classifications according to Flynn:
                                                                                                                3
  Single Instruction, Single Data (SISD):
      A serial (non-parallel) computer
      Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle
      Single data: only one data stream is being used as input during any one clock cycle
      Deterministic execution
      This is the oldest and even today, the most common type of computer
          o     Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
          o Example: UNIVAC 1, IBM 360, CRAY 1, CDC 7600, PDP 1, Dell Laptop
  Single Instruction, Multiple Data (SIMD):
      A type of parallel computer
      Single instruction: All processing units execute the same instruction at any given clock cycle
      Multiple data: Each processing unit can operate on a different data element
      Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image
       processing.
      Synchronous (lockstep) and deterministic execution
      Two varieties: Processor Arrays and Vector Pipelines
      Examples:
           o Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
           o Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,
               ETA10
      Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD
       instructions and execution units.
  Multiple Instruction, Single Data (MISD):
      A single data stream is fed into multiple processing units.
      Each processing unit operates on the data independently via
       independent instruction streams.
      Few actual examples of this class of parallel computer have
       ever existed. One is the experimental Carnegie-Mellon
       C.mmp computer (1971).
      Some conceivable uses might be:
           o multiple frequency filters operating on a single
              signal stream
           o multiple cryptography algorithms attempting to
              crack a single coded message.
Multiple Instruction, Multiple Data (MIMD):
      Currently, the most common type of parallel computer. Most modern computers fall into this category.
      Multiple Instruction: every processor may be executing a different instruction stream
      Multiple Data: every processor may be working with a different data stream
      Execution can be synchronous or asynchronous, deterministic or non-deterministic
      Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-
       processor SMP computers, multi-core PCs.
      Note: many MIMD architectures also include SIMD execution sub-components
                                                                                                              4
Concepts and Terminology
Task
       A logically discrete section of computational work. A task is typically a program or program-like set of
       instructions that is executed by a processor.
Parallel Task
       A task that can be executed by multiple processors safely (yields correct results)
Serial Execution
       Execution of a program sequentially, one statement at a time. In the simplest sense, this is what happens
       on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program
       that must be executed serially.
Parallel Execution
       Execution of a program by more than one task, with each task being able to execute the same or different
       statement at the same moment in time.
Pipelining
       Breaking a task into steps performed by different processor units, with inputs streaming through, much
       like an assembly line; a type of parallel computing.
Shared Memory
       From a strictly hardware point of view, describes a computer architecture where all processors have direct
       (usually bus based) access to common physical memory. In a programming sense, it describes a model
       where parallel tasks all have the same "picture" of memory and can directly address and access the same
       logical memory locations regardless of where the physical memory actually exists.
Symmetric Multi-Processor (SMP)
       Hardware architecture where multiple processors share a single address space and access to all resources;
       shared memory computing.
Distributed Memory
       In hardware, refers to network based memory access for physical memory that is not common. As a
       programming model, tasks can only logically "see" local machine memory and must use communications
       to access memory on other machines where other tasks are executing.
Communications
       Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
       through a shared memory bus or over a network, however the actual event of data exchange is commonly
       referred to as communications regardless of the method employed.
Synchronization
       The coordination of parallel tasks in real time, very often associated with communications. Often
       implemented by establishing a synchronization point within an application where a task may not proceed
       further until another task(s) reaches the same or logically equivalent point.
       Synchronization usually involves waiting by at least one task, and can therefore cause a parallel
       application's wall clock execution time to increase.
Granularity
      In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
              Coarse: relatively large amounts of computational work are done between communication events
              Fine: relatively small amounts of computational work are done between communication events
Observed Speedup
      Observed speedup of a code which has been parallelized, defined as:
                                                                                                               5
        wall-clock time of serial execution
        -----------------------------------
        wall-clock time of parallel execution
       One of the simplest and most widely used indicators for a parallel program's performance.
Parallel Overhead
       The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
       overhead can include factors such as:
              Task start-up time
              Synchronizations
              Data communications
              Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
              Task termination time
Massively Parallel
      Refers to the hardware that comprises a given parallel system - having many processors. The meaning of
      "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors
      numbering in the hundreds of thousands.
Embarrassingly Parallel
      Solving many similar, but independent tasks simultaneously; little to no need for coordination between
      the tasks.
Scalability
       Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in
       parallel speedup with the addition of more processors. Factors that contribute to scalability include:
              Hardware - particularly memory-cpu bandwidths and network communications
              Application algorithm
              Parallel overhead related
              Characteristics of your specific application and coding
Multi-core Processors
       Multiple processors (cores) on a single chip.
Cluster Computing
       Use of a combination of commodity units (processors, networks or SMPs) to build a parallel system.
Supercomputing / High Performance Computing
Use of the world's fastest, largest machines to solve large problems.