KEMBAR78
Example : parallelize a simple problem | PDF
Parallel & Distributed
Computer Systems


Dr. Mohammad Ansari
Course Details
   Delivery
    ◦ Lectures/discussions: English
    ◦ Assessments: English
    ◦ Ask questions in class if you don’t understand
    ◦ Email me after class if you do not want to ask in
      class
    ◦ DO NOT LEAVE QUESTIONS TILL THE DAY BEFORE THE
      EXAM!!!
   Assessments (this may change)
    ◦ Homework (~1 per week): 10%
    ◦ Midterm: 20%
    ◦ 1 project + final exam OR 2 projects: 35%+35%
Course Details
   Textbook
    ◦ Principles of Parallel Programming, Lin & Snyder
   Other sources of information:
    ◦ COMP 322, Rice University
    ◦ CS 194, UC Berkeley
    ◦ Cilk lectures, MIT
   Many sources of information on the
    internet for writing parallelized code
Teaching Materials & Assignments
   Everything is on Jusur
    ◦ Lectures
    ◦ Homeworks
 Submit homework through Jusur
 Homework is given out on Saturday
 Homework due following Saturday
 You lose 10% for each day late
Homework 1
   First homework is available on Jusur
    ◦ Install Linux on your computer
       It is needed for future homework
       It is needed to access the supercomputers
    ◦ Check settings/hardware
       Submit pictures of your settings
       Submit description of your processor
    ◦ Deadline: 27/03/1431 (submit on Jusur)
Cheating in Homework/Projects
   Cheating
    ◦ If you cheat, you get zero
    ◦ If you help others cheat, you will also get zero
    ◦ Copy + paste from Internet, e.g. Wikipedia, or
      elsewhere, is also cheating (called plagiarism)
    ◦ You can read any source of information, but you
      must write answers in your own words
    ◦ If you have problems, please ask for help.
Outline
   Previous lecture:
    ◦ Why study parallel computing?
    ◦ Topics covered on this course
   This lecture:
    ◦ Example problem
   Next week:
    ◦ Parallel processor architectures
Example Problem
 We will parallelize a simple problem
 Begin to explore some of the issues
  related to parallel programming, and
  performance of parallel programs
Example Problem: Array Sum
 Add all the numbers in a large array
 It has 100 million elements
 int size = 100000000;
 int array[] = {7,3,15,10,13,18,6,4,…};
 What code should we write for a
  sequential program?
Example Problem: Sequential
int sum = 0;
int i = 0;
for(i = 0; i < size; i++) {
      sum += array[i]; //sum=sum+array[i];
}
Example Problem: Sequential
How Do We Parallelize?
   Objective: Thinking about parallelism
    ◦ Multiple processors need something to do
       A program/software has to be split into parts
       Each part can be executed on a different
        processor.
    ◦ How do we improve performance over single
      processor?
       If a problem takes 2 seconds on a single processor
       And we break it into two (equal) parts: 1 second
        for each part
       And we execute the two parts separately, but in
        parallel, on two processors, then we improve
        performance
How Do We Parallelize?

Time    Sequential            Parallel

          CPU 0      CPU 0               CPU 1


          Part 0     Part 0              Part 1

1

          Part 1

2
How Do We Start Parallelizing?
   What parts can be done separately?
    ◦ What parts can we do on separate processors?
    ◦ Meaning: What parts have no data dependence
    ◦ Data dependence:
       The execution of an instruction (or line of
        code) is dependent on execution of a previous
        instruction (or line of code).
    ◦ Data independence:
       The execution of an instruction (or line of
        code) is not dependent on execution of a
        previous instruction (or line of code).
Example of Data Dependence
int x = 0;
int y = 5;

x = 3;
y = y + x; //Is this line dependent on
                the previous line?
Data Dependence & Parallelism
   In a sequential program, data
    dependence does not matter: each
    instruction executes in sequence.
    ◦ Instructions execute one by one
   In a parallel program, data
    independence allows parallel execution
    of instructions. Data dependence
    prevents parallel execution of
    instructions.
    ◦ Reduces parallel performance
    ◦ Reduces number of processors that can be used
Why is Data Dependence Bad For
Parallel Programs?
   Does not allow correct parallel execution

          CPU0                CPU1


          x = 3;             y = y + x;


          x = 3;           y = 5; //(5 + 0)
Why is Data Dependence Bad For
Parallel Programs?
   Does not allow correct parallel execution

          CPU0                CPU1


          x = 3;               WAIT




          x = 3;             y = y + x;



                           y = 8; //(5 + 3)
Why is Data Dependence Bad For
Parallel Programs?
   Does not allow correct parallel execution

          CPU0


          x = 3;


         y = y + x;

       x = 3; y = 8;
Example of Data Independence
int x = 0;
int y = 5;

x = 3;
y = y + 5; //Is this line dependent on
                the previous line?
Why is Data Independence Useful?
   Allows correct parallel execution

         CPU0             CPU1


         x = 3;          y = y + 5;


         x = 3;           y = 10;
Back to Array Sum Example
Does the code have data dependence?

int sum = 0;
for(int i = 0; i < size; i++) {
     sum += array[i]; //sum=sum+array[i];
}
Back to Array Sum Example
Does the code have data dependence?

int sum = 0;
for(int i = 0; i < size; i++) {
     sum += array[i]; //sum=sum+array[i];
}

Not so easy to see
Back to Array Sum Example
Let’s unroll the loop:

int sum = 0;
sum += array[0];   //sum=sum+array[0];
sum += array[1];   //sum=sum+array[1];
sum += array[2];   //sum=sum+array[2];
sum += array[3];   //sum=sum+array[3];
…

Now we can see dependence!
Example Problem: Sequential
Removing Dependencies
   Sometimes this is possible.
    ◦ Dependencies discussed in detail later.
   Tip: Can be useful to look at the
    problem being solved by the
    code, and not the code itself.
Break Sum into Pieces
          P0                         P1

  7   3    1   0   2         9   5   8    3   6




          S0                         S1




                       SUM
Some Details…
 A program executes inside a process
 If we want to use multiple processors
    ◦ We need multiple processes
    ◦ One process for each processor (not fixed rule)
 Processes are big, heavyweight
 Threads are lighter than processes
    ◦ But same strategy
    ◦ One thread for each processor (not fixed rule)
   We will talk about threads and
    processes later, if necessary
What Does the Code Look Like?
int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
          middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
    for(i = 0; i < numThreads; i++) {
          sum += middleSum[i];
    }
}
Load Balancing
   Which processor is doing more work?
           P0                         P1

       7    3   1         0   2   9   5    8   3   6




           S0                         S1




                    SUM
Load Balancing

 Time     Sequential            Parallel

             P0         P0                  P1

            Part 0     Part 0
                                           Part 1
1.0
1.3         Part 1

2.0
Example Problem: Array Sum
 Parallelized code is more complex
 Requires us to think differently about
  how to solve the problem
    ◦ Need to think about breaking it into parts
    ◦ Analyze data dependencies, remove if possible
    ◦ Need to load balance for better performance
Example Problem: Array Sum
   However, the parallel code is broken
    ◦ Thread 0 adds all the middle sums.
    ◦ What if thread 0 finishes its own work, but
      other threads have not?
Synchronization
   P0 will probably finish before P1
           P0                         P1

       7    3   1         0   2   9   5    8   3   6




           S0                         S1




                    SUM
How Can We Fix The Code to
GUARANTEE It Works Correctly?
int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
          middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
    for(i = 0; i < numThreads; i++) {
          sum += middleSum[i];
    }
}
Synchronization
 Sometimes we need to
  coordinate/organize threads
 If we don’t, the code might calculate the
  wrong answer to the problem
 Can happen even if load balance is perfect
 Synchronization is concerned with this
  coordination / organization
Code with Synchronization Fixed
int numThreads = 2; //Assume one thread per core, & 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
          middleSum[threadID] += array[i];
}
waitForAllThreads(); //Wait for all threads
//Only thread 0 will execute this code
if (threadID==0) {
    for(i = 0; i < numThreads; i++) {
          sum += middleSum[i];
    }
}
Synchronization
 The example shows a barrier
 This is one type of synchronization
 Barriers require all threads to reach
  that point in the code, before any
  thread is allowed to continue
 It is like a gate. All threads come to
  the gate, and then it opens.
Generalizing the Solution
 We only looked at how to parallelize
  for 2 threads
 But the code is more general
    ◦ Can use any number of threads
    ◦ Important that code is written this way
    ◦ We will look at this in more detail later
Parallel Program
Performance
   Now the program is correct
   Let’s look at performance
            Time on 2-core Processor
    1
0.8
0.6
0.4
0.2
    0
        1 Thread      2 Threads        4 Threads
Performance
   Two-threads are not 2x fast. Why?
    ◦ The problem is called false sharing
    ◦ To understand this, we have to look at the
      computer architecture
    ◦ We will study this in the next lecture
   Four-threads slower than two-threads.
    Why?
    ◦ The processor only has two cores
    ◦ Four threads adds scheduling overhead, wastes
      time
Summary
   Used an example to start looking at
    how to parallelize code, and some of
    the main issues
    ◦ Data dependence
    ◦ Load balancing
    ◦ Synchronization
   Each will be discussed in more detail
    in later lectures

Example : parallelize a simple problem

  • 1.
    Parallel & Distributed ComputerSystems Dr. Mohammad Ansari
  • 2.
    Course Details  Delivery ◦ Lectures/discussions: English ◦ Assessments: English ◦ Ask questions in class if you don’t understand ◦ Email me after class if you do not want to ask in class ◦ DO NOT LEAVE QUESTIONS TILL THE DAY BEFORE THE EXAM!!!  Assessments (this may change) ◦ Homework (~1 per week): 10% ◦ Midterm: 20% ◦ 1 project + final exam OR 2 projects: 35%+35%
  • 3.
    Course Details  Textbook ◦ Principles of Parallel Programming, Lin & Snyder  Other sources of information: ◦ COMP 322, Rice University ◦ CS 194, UC Berkeley ◦ Cilk lectures, MIT  Many sources of information on the internet for writing parallelized code
  • 4.
    Teaching Materials &Assignments  Everything is on Jusur ◦ Lectures ◦ Homeworks  Submit homework through Jusur  Homework is given out on Saturday  Homework due following Saturday  You lose 10% for each day late
  • 5.
    Homework 1  First homework is available on Jusur ◦ Install Linux on your computer  It is needed for future homework  It is needed to access the supercomputers ◦ Check settings/hardware  Submit pictures of your settings  Submit description of your processor ◦ Deadline: 27/03/1431 (submit on Jusur)
  • 6.
    Cheating in Homework/Projects  Cheating ◦ If you cheat, you get zero ◦ If you help others cheat, you will also get zero ◦ Copy + paste from Internet, e.g. Wikipedia, or elsewhere, is also cheating (called plagiarism) ◦ You can read any source of information, but you must write answers in your own words ◦ If you have problems, please ask for help.
  • 7.
    Outline  Previous lecture: ◦ Why study parallel computing? ◦ Topics covered on this course  This lecture: ◦ Example problem  Next week: ◦ Parallel processor architectures
  • 8.
    Example Problem  Wewill parallelize a simple problem  Begin to explore some of the issues related to parallel programming, and performance of parallel programs
  • 9.
    Example Problem: ArraySum  Add all the numbers in a large array  It has 100 million elements  int size = 100000000;  int array[] = {7,3,15,10,13,18,6,4,…};  What code should we write for a sequential program?
  • 10.
    Example Problem: Sequential intsum = 0; int i = 0; for(i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; }
  • 11.
  • 12.
    How Do WeParallelize?  Objective: Thinking about parallelism ◦ Multiple processors need something to do  A program/software has to be split into parts  Each part can be executed on a different processor. ◦ How do we improve performance over single processor?  If a problem takes 2 seconds on a single processor  And we break it into two (equal) parts: 1 second for each part  And we execute the two parts separately, but in parallel, on two processors, then we improve performance
  • 13.
    How Do WeParallelize? Time Sequential Parallel CPU 0 CPU 0 CPU 1 Part 0 Part 0 Part 1 1 Part 1 2
  • 14.
    How Do WeStart Parallelizing?  What parts can be done separately? ◦ What parts can we do on separate processors? ◦ Meaning: What parts have no data dependence ◦ Data dependence:  The execution of an instruction (or line of code) is dependent on execution of a previous instruction (or line of code). ◦ Data independence:  The execution of an instruction (or line of code) is not dependent on execution of a previous instruction (or line of code).
  • 15.
    Example of DataDependence int x = 0; int y = 5; x = 3; y = y + x; //Is this line dependent on the previous line?
  • 16.
    Data Dependence &Parallelism  In a sequential program, data dependence does not matter: each instruction executes in sequence. ◦ Instructions execute one by one  In a parallel program, data independence allows parallel execution of instructions. Data dependence prevents parallel execution of instructions. ◦ Reduces parallel performance ◦ Reduces number of processors that can be used
  • 17.
    Why is DataDependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 CPU1 x = 3; y = y + x; x = 3; y = 5; //(5 + 0)
  • 18.
    Why is DataDependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 CPU1 x = 3; WAIT x = 3; y = y + x; y = 8; //(5 + 3)
  • 19.
    Why is DataDependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 x = 3; y = y + x; x = 3; y = 8;
  • 20.
    Example of DataIndependence int x = 0; int y = 5; x = 3; y = y + 5; //Is this line dependent on the previous line?
  • 21.
    Why is DataIndependence Useful?  Allows correct parallel execution CPU0 CPU1 x = 3; y = y + 5; x = 3; y = 10;
  • 22.
    Back to ArraySum Example Does the code have data dependence? int sum = 0; for(int i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; }
  • 23.
    Back to ArraySum Example Does the code have data dependence? int sum = 0; for(int i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; } Not so easy to see
  • 24.
    Back to ArraySum Example Let’s unroll the loop: int sum = 0; sum += array[0]; //sum=sum+array[0]; sum += array[1]; //sum=sum+array[1]; sum += array[2]; //sum=sum+array[2]; sum += array[3]; //sum=sum+array[3]; … Now we can see dependence!
  • 25.
  • 26.
    Removing Dependencies  Sometimes this is possible. ◦ Dependencies discussed in detail later.  Tip: Can be useful to look at the problem being solved by the code, and not the code itself.
  • 27.
    Break Sum intoPieces P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 28.
    Some Details…  Aprogram executes inside a process  If we want to use multiple processors ◦ We need multiple processes ◦ One process for each processor (not fixed rule)  Processes are big, heavyweight  Threads are lighter than processes ◦ But same strategy ◦ One thread for each processor (not fixed rule)  We will talk about threads and processes later, if necessary
  • 29.
    What Does theCode Look Like? int numThreads = 2; //Assume one thread per core, and 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 30.
    Load Balancing  Which processor is doing more work? P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 31.
    Load Balancing Time Sequential Parallel P0 P0 P1 Part 0 Part 0 Part 1 1.0 1.3 Part 1 2.0
  • 32.
    Example Problem: ArraySum  Parallelized code is more complex  Requires us to think differently about how to solve the problem ◦ Need to think about breaking it into parts ◦ Analyze data dependencies, remove if possible ◦ Need to load balance for better performance
  • 33.
    Example Problem: ArraySum  However, the parallel code is broken ◦ Thread 0 adds all the middle sums. ◦ What if thread 0 finishes its own work, but other threads have not?
  • 34.
    Synchronization  P0 will probably finish before P1 P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 35.
    How Can WeFix The Code to GUARANTEE It Works Correctly? int numThreads = 2; //Assume one thread per core, and 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 36.
    Synchronization  Sometimes weneed to coordinate/organize threads  If we don’t, the code might calculate the wrong answer to the problem  Can happen even if load balance is perfect  Synchronization is concerned with this coordination / organization
  • 37.
    Code with SynchronizationFixed int numThreads = 2; //Assume one thread per core, & 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } waitForAllThreads(); //Wait for all threads //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 38.
    Synchronization  The exampleshows a barrier  This is one type of synchronization  Barriers require all threads to reach that point in the code, before any thread is allowed to continue  It is like a gate. All threads come to the gate, and then it opens.
  • 39.
    Generalizing the Solution We only looked at how to parallelize for 2 threads  But the code is more general ◦ Can use any number of threads ◦ Important that code is written this way ◦ We will look at this in more detail later
  • 40.
  • 41.
    Performance  Now the program is correct  Let’s look at performance Time on 2-core Processor 1 0.8 0.6 0.4 0.2 0 1 Thread 2 Threads 4 Threads
  • 42.
    Performance  Two-threads are not 2x fast. Why? ◦ The problem is called false sharing ◦ To understand this, we have to look at the computer architecture ◦ We will study this in the next lecture  Four-threads slower than two-threads. Why? ◦ The processor only has two cores ◦ Four threads adds scheduling overhead, wastes time
  • 43.
    Summary  Used an example to start looking at how to parallelize code, and some of the main issues ◦ Data dependence ◦ Load balancing ◦ Synchronization  Each will be discussed in more detail in later lectures