0% found this document useful (1 vote)

230 views40 pages

Web GPU

The document discusses the development of an online programming environment for teaching a large-scale Coursera course on parallel programming. Key points: - The system allowed instant grading and peer review, which previous systems lacked, and was designed with scaling in mind to support thousands of users. - It uses a distributed architecture where user code runs on "worker" machines and results are returned to Coursera. This approach allowed grading of thousands of submissions. - Lessons learned include the importance of avoiding abstraction initially for debuggability, preparing for last-minute work, and developing a thick skin for negative feedback which can be discouraging.

Uploaded by

IDJIBB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

230 views40 pages

Web GPU

Uploaded by

IDJIBB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Teaching ECE408 at Scale

Abdul Dakkak

Overview

IMPACT "

History"

Architecture"

Some data"

Lessons"

Future
01
!2

Quick Demo
!3

IMPACT

Nothing similar exists"

Possibly the most visible project in our group"

Works for CUDA/OpenCL/OpenACC/"

Thousands of users spent hundreds of hours on the

site"

Production vs research implementation

Objective

Create a programming
environment for the Coursera
course"

Allow people to develop

outside of the environment"

Not be tied into Coursera

(want to offer it for summer
school)

01
!5

Previous System

Many user interface issues"

Grading was done offline"

Had trouble with scaling

Did not have peer reviewing

New System

Instant grading and

submission back to Coursera"

Peer review implemented"

Scaling was kept in mind"

Making a new MP is simple

01
!7

How do you grade 4000 Programs?

You dont.

Architecture
!8

Life of a Program Submission

Library+!
User Code

Life of a
Worker
Program
Submission

User interacts with a web server"

Web server dispatches job to one of

many workers"

Grades get communicated with

Coursera

Library+!
User Code

Worker

Web!
Server

Cours-!
era
!1010

Detailed Architecture
!11

Made it public,
so got some help
from students

Correlated and
source of major
problems (happens
on Cyclone and AWS)

Source of Bugs

Incorrect or
inefficient
queries

!12

db
ox
ed
ro
ot

Sa
n
no
n-

Library+!
User Code

Worker

Web!
Server

Security
Sandbox users code"

Workers, web server, and

DB are run with no privileges"

A simple proxy server routes all port

80 requests to the web server

no
n-

ro
ot

Cours-!
era
!1013

How to Scale?

Each request is a lightweight thread

(thousands of these)"

The lightweight threads get mapped to n

threads (n being the number of cores)"

Create a connection pool with the

database server"

Most of operations are asynchronous "

Master can communicate to any number

of workers"

Worker will run programs on different

GPUs if available

Web!
Server

Worker
Web!
Server
Worker
!1014

Scale

What works for 2 people may not work for 1000 people"

What works for 1000 people may not work when all are logged in
at the same time"

What works for 1000 people on day 1 may not work for 1000
people on day 50
!15

Lessons Learned (Implementation)

!16

Abstraction is not a Good Thing

At the start of the course, we were using a library to

simplify database queries it did simplify them, but
generated complex queries that were not possible to
debug
!17

Abstraction is a Great Thing

When we replaced the library, had to replace just the

model code in the application

!18

Lessons Learned (Human)

!19

Lessons Learned

Negative criticism speaks louder than positive ones"

People do not read documentation or search the forums"

You will feel discouraged and moody for the rest of the day"

You will answer the same thing over and over and over again"

People will send you their code and ask you to debug them"

They will feel like you are not doing your job if you tell them no
!20

Data
!21

Data Collected

Google Analytics"

A 13Gb database containing"

2 Million program revisions"

700 thousand program runs"

Runtime/Compilation time / errors"

6 thousand graded programs"

A 7Gb Event database containing"

CPU/GPU information"

What pages users Visited

!22

Thousands of Visitors
!23

Users from All over the World

!24

People Spend Time on the Site

!25

Grades
!26

Most People Passed

939 actually started the course XXXX this is not correct FIXME"

989 got over 80%"

648 got over 100%

!27

Some Data Insights

Date
!28

We made all course MPs

available on day 1"

Everyone waited until the last

minute"

Every MP has 2 bumps one

for submitting the code and
another for the peer review

n
D
ow
M
P1

D
ue

ue
D

W
eb
si
te

P2
M

People Submit at
the Last Minute

Grade Timeline

Students Work
Program save timeline

!30

Some people do not

give or receive
feedback"

Some people receive

feedback but do not
give any"

Some people give

feedback but do not
receive any

Peer Review Relies on Participation

Image Convolution MP
!31

Data Analysis Opportunities

!32

!
!

#define BLOCK_SIZE 16
__global__ void MatMulKernel(float *DA, float *DB, float *DC, int Ah, int Aw,
int AwTiles, int Bh, int Bw) {
// Block row and column
if (cellRow >= Ah)
int blockRow = blockIdx.y;
return;
int blockCol = blockIdx.x;
if (cellCol >= Bw)
return;
// Thread row and column within Csub
int row = threadIdx.y;
DC[cellRow * Bw + cellCol] = Cvalue;
int col = threadIdx.x;
}

!
!
!
!

//
//
//
//

!
!

int cellRow = blockRow * BLOCK_SIZE + row;

int cellCol = blockCol * BLOCK_SIZE + col;
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0.0;
Loop over all the sub-matrices of A and B that are
required to compute Csub
Multiply each pair of sub-matrices together
and accumulate the results

#pragma unroll
for (int m = 0; m < AwTiles; ++m) {
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

// Each thread loads one element of each sub-matrix
int aRow = BLOCK_SIZE * blockRow + row;
int aCol = BLOCK_SIZE * m + col;

float aValue = ((aRow < Ah) && (aCol < Aw));

aValue *= DA[Aw * aRow + aCol];
As[row][col] = aValue;

!
!

int bRow = BLOCK_SIZE * m + row;

int bCol = BLOCK_SIZE * blockCol + col;
float bValue = ((bRow < Bh) && (bCol < Bw));
bValue *= DB[Bw * bRow + bCol];
Bs[row][col] = bValue;

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation
__syncthreads();

// Multiply Asub and Bsub together

for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];

// Synchronize to make sure that the preceding

// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}

!
!
!!

global void nop() {}

//
//
CITE: Vasily Volkov, UC Berkeley.... Intresting approach to using
//caching a bit differently... had to give it a try...
//
__device__ void saxpy(float a, float *b, float *c) {
c[0] += a * b[0];
c[1] += a * b[1];
c[2] += a * b[2];
c[3] += a * b[3];
c[4] += a * b[4];
c[5] += a * b[5];
c[6] += a * b[6];
c[7] += a * b[7];
c[8] += a * b[8];
c[9] += a * b[9];
c[10] += a * b[10];
c[11] += a * b[11];
c[12] += a * b[12];
c[13] += a * b[13];
c[14] += a * b[14];
c[15] += a * b[15];
}

!
!

!
!
!
!

A += 4 * lda;
saxpy(a[0], &bs[0][0],
a[0] = A[0 * lda];
saxpy(a[1], &bs[1][0],
a[1] = A[1 * lda];
saxpy(a[2], &bs[2][0],
a[2] = A[2 * lda];
saxpy(a[3], &bs[3][0],
a[3] = A[3 * lda];

c);
c);
c);
c);

A += 4 * lda;
saxpy(a[0], &bs[4][0],
a[0] = A[0 * lda];
saxpy(a[1], &bs[5][0],
a[1] = A[1 * lda];
saxpy(a[2], &bs[6][0],
a[2] = A[2 * lda];
saxpy(a[3], &bs[7][0],
a[3] = A[3 * lda];

c);
c);
c);
c);

A += 4 * lda;
saxpy(a[0], &bs[8][0], c);
a[0] = A[0 * lda];
saxpy(a[1], &bs[9][0], c);
a[1] = A[1 * lda];
saxpy(a[2], &bs[10][0], c);
a[2] = A[2 * lda];
saxpy(a[3], &bs[11][0], c);
a[3] = A[3 * lda];
A += 4 * lda;
saxpy(a[0], &bs[12][0],
saxpy(a[1], &bs[13][0],
saxpy(a[2], &bs[14][0],
saxpy(a[3], &bs[15][0],

c);
c);
c);
c);

SGEMM Implementation

!
!

__global__ void optimisedDLA(const float *A, int lda, const float *B, int ldb,
B += 16;
float *C, int ldc, int k) {
__syncthreads();
const int inx = threadIdx.x;
}
while (B < Blast);
const int iny = threadIdx.y;
const int ibx = blockIdx.x * 64;
for (int i = 0; i < 16; i++, C += ldc)
const int iby = blockIdx.y * 16;
C[0] = c[i];
const int id = inx + iny * 16;
}
.
A += ibx + id;

!
!
!
!
!

B += inx + __mul24(iby + iny, ldb);

C += ibx + id + __mul24(iby, ldc);
const float *Blast = B + k;
float c[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
do {
float a[4] = {A[0 * lda], A[1 * lda], A[2 * lda], A[3 * lda]};
__shared__ float bs[16][17];
bs[inx][iny] = B[0 * ldb];
bs[inx][iny + 4] = B[4 * ldb];
bs[inx][iny + 8] = B[8 * ldb];
bs[inx][iny + 12] = B[12 * ldb];
__syncthreads();

!33

Analysis Opportunities

What errors are most common?"

What optimizations are most beneficial?"

Can we detect those and give feedback"

Which did users have the most problems with?"

What causes GPUs to crash?"

Can we avoid those?"

Did people plagiarize?"

What does the peer review tell us?

!34

Source of Data

This is a big data problem"

Different analysis can be performed on different parts"

Program analysis on programs"

Power analysis on recorded GPU power draw"

NLP on questions and peer review

!35

What would be different?

!36

Lessons Learned

Volunteer TAs were very helpful and allowed me to

spend less than 30 hours a day on the forums"

Not everyone will perform peer reviews, which means

that not everyone will get feedback"

Some people just criticize for the sake of it you need

to develop a thicker shell

!37

Current Work
!38

Current Work

Build up some tools for data analysis"

Course work projects"

Porting Parboil to threaded Java and Renderscript"

ZOne a compiler/language to explore Map/Reduce

compiler optimizations"

Optimization Machine learning+Vision applications

!39

Questions?
!40

CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
HPC File
No ratings yet
HPC File
22 pages
SYCL Bfloat16 Matrix Multiplication
No ratings yet
SYCL Bfloat16 Matrix Multiplication
4 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
Lab 7
No ratings yet
Lab 7
3 pages
5 Computation
No ratings yet
5 Computation
13 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
278 hw5
No ratings yet
278 hw5
20 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Rishi
No ratings yet
Rishi
30 pages
Coding Practices SSW
No ratings yet
Coding Practices SSW
47 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Eliminating The Hardware/Software Divide: Satnam Singh, Microsoft Research Cambridge, UK
No ratings yet
Eliminating The Hardware/Software Divide: Satnam Singh, Microsoft Research Cambridge, UK
146 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Less Slow C++ - Hacker News
No ratings yet
Less Slow C++ - Hacker News
3 pages
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
No ratings yet
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
45 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
HPC Codes
No ratings yet
HPC Codes
14 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
Liz Asset
No ratings yet
Liz Asset
29 pages
Ass2 cs637 Merged Organized
No ratings yet
Ass2 cs637 Merged Organized
18 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
Processors
No ratings yet
Processors
25 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Cache Performance
No ratings yet
Cache Performance
44 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
Assignment 04
No ratings yet
Assignment 04
16 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
Tilining
No ratings yet
Tilining
23 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
HPC (Pra 04)
No ratings yet
HPC (Pra 04)
11 pages
Advanced Vector Algorithms
No ratings yet
Advanced Vector Algorithms
1 page
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
CUDA for Parallel Computing Experts
No ratings yet
CUDA for Parallel Computing Experts
33 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Cache Complexity
No ratings yet
Cache Complexity
97 pages
HPC Codes-2
No ratings yet
HPC Codes-2
15 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
README
No ratings yet
README
4 pages
That Rich Girl - Wattpad
0% (3)
That Rich Girl - Wattpad
45 pages
Upload A Document To Access Your Download: Bridge Engineering Handbook Superstructure Design
No ratings yet
Upload A Document To Access Your Download: Bridge Engineering Handbook Superstructure Design
3 pages
Present Perfect Simple - Positive and Negative Worksheet Live Worksheets
No ratings yet
Present Perfect Simple - Positive and Negative Worksheet Live Worksheets
1 page
PrePress and Preflight Essentials
0% (1)
PrePress and Preflight Essentials
26 pages
Paths To A Green World: The Political Economy of The Global Environment
No ratings yet
Paths To A Green World: The Political Economy of The Global Environment
19 pages
Data Loader
100% (1)
Data Loader
6 pages
Author+Flyer +Tiny+Android+Projects+Using+Kotlin
No ratings yet
Author+Flyer +Tiny+Android+Projects+Using+Kotlin
1 page
chatGPt Game ps3
No ratings yet
chatGPt Game ps3
6 pages
Led TV: User Manual
No ratings yet
Led TV: User Manual
2 pages
PAYU+Comparison+Table New02092021
No ratings yet
PAYU+Comparison+Table New02092021
2 pages
Operating System and Development Stages of The Windows Operating System
No ratings yet
Operating System and Development Stages of The Windows Operating System
3 pages
Introduction To SAP BTP
0% (1)
Introduction To SAP BTP
15 pages
Studio One 5 Reference Manual 31072020
No ratings yet
Studio One 5 Reference Manual 31072020
372 pages
Week 1 - Introduction To Social Media Marketing
No ratings yet
Week 1 - Introduction To Social Media Marketing
3 pages
Android SDK Setup for Developers
No ratings yet
Android SDK Setup for Developers
11 pages
How To Encrypt Configuration Sections in ASP - Net 3.5 or Later Using RSA
No ratings yet
How To Encrypt Configuration Sections in ASP - Net 3.5 or Later Using RSA
4 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
3 pages
Cloud Questions and Answer
100% (1)
Cloud Questions and Answer
14 pages
IC Marketing Budget Plan 8603
No ratings yet
IC Marketing Budget Plan 8603
7 pages
Brother ADS 1000W-1500W User Manual PDF
No ratings yet
Brother ADS 1000W-1500W User Manual PDF
246 pages
STEP Procurement System Guide
100% (1)
STEP Procurement System Guide
64 pages
Digital Vortex 2019
100% (1)
Digital Vortex 2019
16 pages
Thomas Calculus 9th Edition Solution Manual PDF
25% (8)
Thomas Calculus 9th Edition Solution Manual PDF
2 pages
Employee Management System: Computer Science (Python)
No ratings yet
Employee Management System: Computer Science (Python)
17 pages
06 Task Performance 1 ARG
No ratings yet
06 Task Performance 1 ARG
7 pages
PCSE02-GETCERT - 2024.03.27 - Google Cloud Skills Boost Access Instructions
No ratings yet
PCSE02-GETCERT - 2024.03.27 - Google Cloud Skills Boost Access Instructions
2 pages
Salesforce Personalization Quick Start Gilead
No ratings yet
Salesforce Personalization Quick Start Gilead
2 pages
Analytics - Big Data Made Simple
No ratings yet
Analytics - Big Data Made Simple
5 pages
Datasheet of DS 7732NI K4 NVR D or E - V4.71.400 - 20230920
No ratings yet
Datasheet of DS 7732NI K4 NVR D or E - V4.71.400 - 20230920
5 pages