KEMBAR78
JavaScript on the GPU | PDF
If you don’t get this ref...shame on you
Jarred Nicholls
  @jarrednicholls
jarred@webkit.org
Work @ Sencha
Web Platform Team

   Doing webkitty things...
WebKit Committer
Co-Author
W3C Web Cryptography
        API
JavaScript on the GPU
What I’ll blabber about today
Why JavaScript on the GPU
Running JavaScript on the GPU

What’s to come...
Why JavaScript on the GPU?
Why JavaScript on the GPU?

        Better question:
         Why a GPU?
Why JavaScript on the GPU?

        Better question:
         Why a GPU?

        A: They’re fast!
         (well, at certain things...)
GPUs are fast b/c...
Totally different paradigm from CPUs
Data parallelism vs. Task parallelism
Stream processing vs. Sequential processing
    GPUs can divide-and-conquer
Hardware capable of a large number of “threads”
    e.g. ATI Radeon HD 6770m:
    480 stream processing units == 480 cores
Typically very high memory bandwidth
Many, many GigaFLOPs
GPUs don’t solve all problems
Not all tasks can be accelerated by GPUs
Tasks must be parallelizable, i.e.:
    Side effect free
    Homogeneous and/or streamable
Overall tasks will become limited by Amdahl’s Law
Let’s find out...
Experiment
Code Name “LateralJS”
LateralJS

Our Mission
To make JavaScript a first-class citizen on all GPUs
and take advantage of hardware accelerated
operations & data parallelization.
Our Options
          OpenCL                 Nvidia CUDA
AMD, Nvidia, Intel, etc.   Nvidia only
A shitty version of C99    C++ (C for CUDA)
No dynamic memory          Dynamic memory
No recursion               Recursion
No function pointers       Function pointers
Terrible tooling           Great dev. tooling
Immature (arguably)        More mature (arguably)
Our Options
          OpenCL                 Nvidia CUDA
AMD, Nvidia, Intel, etc.   Nvidia only
A shitty version of C99    C++ (C for CUDA)
No dynamic memory          Dynamic memory
No recursion               Recursion
No function pointers       Function pointers
Terrible tooling           Great dev. tooling
Immature (arguably)        More mature (arguably)
Why not a Static Compiler?
We want full JavaScript support
    Object / prototype
    Closures
    Recursion
    Functions as objects
    Variable typing
Type Inference limitations
Reasonably limited to size and complexity of “kernel-
esque” functions
Not nearly insane enough
Why an Interpreter?
We want it all baby - full JavaScript support!
Most insane approach
Challenging to make it good, but holds a lot of promise
OpenCL Headaches
Oh the agony...
Multiple memory spaces - pointer hell
No recursion - all inlined functions
No standard libc libraries
No dynamic memory
No standard data structures - apart from vector ops
Buggy ass AMD/Nvidia compilers
Multiple Memory Spaces
In the order of fastest to slowest:
         space                 description
                   very fast
         private   stream processor cache (~64KB)
                   scoped to a single work item
                   fast
          local    ~= L1 cache on CPUs (~64KB)
                   scoped to a single work group
                   slow, by orders of magnitude
         global    ~= system memory over slow bus
        constant   available to all work groups/items
                   all the VRAM on the card (MBs)
Memory Space Pointer Hell
global uchar* gptr = 0x1000;
local uchar* lptr = (local uchar*) gptr; // FAIL!
uchar* pptr = (uchar*) gptr; // FAIL! private is implicit


                               0x1000




             global             local            private




            0x1000 points to something different
              depending on the address space!
Memory Space Pointer Hell
           Pointers must always be fully qualified
                Macros to help ease the pain

#define   GPTR(TYPE)   global TYPE*
#define   CPTR(TYPE)   constant TYPE*
#define   LPTR(TYPE)   local TYPE*
#define   PPTR(TYPE)   private TYPE*
No Recursion!?!?!?
  No call stack
  All functions are inlined to the kernel function


uint factorial(uint n) {
    if (n <= 1)
         return 1;
    else
         return n * factorial(n - 1); // compile-time error
}
No standard libc libraries
memcpy?
strcpy?
strcmp?
etc...
No standard libc libraries
                              Implement our own
#define MEMCPY(NAME, DEST_AS, SRC_AS) 
    DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); 
    DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { 
        DEST_AS uchar* cDest = (DEST_AS uchar*)dest; 
        SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; 
        for (uint i = 0; i < size; i++) 
            cDest[i] = cSrc[i]; 
        return (DEST_AS void*)cDest; 
    }
PTR_MACRO_DEST_SRC(MEMCPY, memcpy)


                                        Produces
             memcpy_g            memcpy_gc           memcpy_lc           memcpy_pc
             memcpy_l            memcpy_gl           memcpy_lg           memcpy_pg
             memcpy_p            memcpy_gp           memcpy_lp           memcpy_pl
No dynamic memory
No malloc()
No free()
What to do...
Yes! dynamic memory
  Create a large buffer of global memory - our “heap”
  Implement our own malloc() and free()
  Create a handle structure - “virtual memory”
  P(T, hnd) macro to get the current pointer address

GPTR(handle) hnd = malloc(sizeof(uint));
GPTR(uint) ptr = P(uint, hnd);
*ptr = 0xdeadbeef;
free(hnd);
Ok, we get the point...
        FYL!
High-level Architecture
       V8                 Data Heap



Esprima Parser            Stack-based
                          Interpreter

                          Host
                          Host
     Host                  GPUs
Data Serializer &
  Marshaller           Garbage Collector



  Device Mgr
High-level Architecture
                    eval(code);
       V8                               Data Heap
                    Build JSON AST

Esprima Parser                          Stack-based
                                        Interpreter

                                        Host
                                        Host
     Host                                GPUs
Data Serializer &
  Marshaller                         Garbage Collector



  Device Mgr
High-level Architecture
                    eval(code);
       V8                                  Data Heap
                    Build JSON AST

Esprima Parser                             Stack-based
                                           Interpreter
                       Serialize AST
                                           Host
                                           Host
     Host           JSON => C Structs       GPUs
Data Serializer &
  Marshaller                            Garbage Collector



  Device Mgr
High-level Architecture
                         eval(code);
       V8                                         Data Heap
                          Build JSON AST

Esprima Parser                                    Stack-based
                                                  Interpreter
                            Serialize AST
                                                  Host
                                                  Host
     Host                JSON => C Structs         GPUs
Data Serializer &
  Marshaller                                   Garbage Collector
                    Ship to GPU to Interpret

  Device Mgr
High-level Architecture
                         eval(code);
       V8                                         Data Heap
                          Build JSON AST

Esprima Parser                                    Stack-based
                                                  Interpreter
                            Serialize AST
                                                  Host
                                                  Host
     Host                JSON => C Structs         GPUs
Data Serializer &
  Marshaller                                   Garbage Collector
                    Ship to GPU to Interpret

  Device Mgr
                          Fetch Result
AST Generation
AST Generation

                                     JSON AST
JavaScript Source
                                    (v8::Object)




                                                   Lateral AST
                    Esprima in V8
                                                   (C structs)
Embed esprima.js

              Resource Generator

$ resgen esprima.js resgen_esprima_js.c
Embed esprima.js

                       resgen_esprima_js.c
const unsigned char resgen_esprima_js[]   = {
    0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43,   0x6f, 0x70, 0x79, 0x72,
    0x69, 0x67, 0x68, 0x74, 0x20, 0x28,   0x43, 0x29, 0x20, 0x32,
    ...
    0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a,   0x0a, 0
};
Embed esprima.js
                          ASTGenerator.cpp
extern const char resgen_esprima_js;

void ASTGenerator::init()
{
    HandleScope scope;
    s_context = Context::New();
    s_context->Enter();
    Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js));
    script->Run();
    s_context->Exit();
    s_initialized = true;
}
Build JSON AST

                    e.g.
ASTGenerator::esprimaParse(
    "var xyz = new Array(10);"
);
Build JSON AST
Handle<Object> ASTGenerator::esprimaParse(const char* javascript)
{
    if (!s_initialized)
        init();


    HandleScope scope;
    s_context->Enter();
    Handle<Object> global = s_context->Global();
    Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima")));
    Handle<Function> esprimaParse = Handle<Function>::Cast(esprima-
>Get(String::New("parse")));
    Handle<String> code = String::New(javascript);
    Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1,
(Handle<Value>*)&code));


    s_context->Exit();
    return scope.Close(ast);
}
Build JSON AST
{
    "type": "VariableDeclaration",
    "declarations": [
        {
            "type": "VariableDeclarator",
            "id": {
                "type": "Identifier",
                "name": "xyz"
            },
            "init": {
                "type": "NewExpression",
                "callee": {
                    "type": "Identifier",
                    "name": "Array"
                },
                "arguments": [
                    {
                        "type": "Literal",
                        "value": 10
                    }
                ]
            }
        }
    ],
    "kind": "var"
}
Lateral AST structs
typedef struct ast_type_st {         #ifdef __OPENCL_VERSION__
    CL(uint) id;                     #define CL(TYPE) TYPE
    CL(uint) size;                   #else
} ast_type;                          #define CL(TYPE) cl_##TYPE
                                     #endif
typedef struct ast_program_st {
    ast_type type;
    CL(uint) body;
    CL(uint) numBody;
                                      Structs shared between
} ast_program;                           Host and OpenCL

typedef struct ast_identifier_st {
    ast_type type;
    CL(uint) name;
} ast_identifier;
Lateral AST structs

                            v8::Object => ast_type
                                  expanded
ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");
ast_type* vd1_1_init_args[1];
vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);
ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);
free(vd1_1_init_id);
for (int i = 0; i < 1; i++)
    free(vd1_1_init_args[i]);
ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");
ast_type* vd1_decls[1];
vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);
free(vd1_1_id);
free(vd1_1_init);
ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");
for (int i = 0; i < 1; i++)
    free(vd1_decls[i]);
Lateral AST structs
                          astCreateIdentifier
ast_identifier* astCreateIdentifier(const char* str) {
    CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4);
    ast_identifier* ast_id = (ast_identifier*)malloc(size);

    // copy the string
    strcpy((char*)(ast_id + 1), str);

    // fill the struct
    ast_id->type.id = AST_IDENTIFIER;
    ast_id->type.size = size;
    ast_id->name = sizeof(ast_identifier); // offset

    return ast_id;
}
Lateral AST structs
         astCreateIdentifier(“xyz”)
offset      field              value
  0        type.id    AST_IDENTIFIER (0x01)
  4       type.size             16
  8        name             12 (offset)
 12        str[0]               ‘x’
 13        str[1]               ‘y’
 14        str[2]               ‘z’
 15        str[3]              ‘0’
Lateral AST structs
                                  astCreateNewExpression
ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) {
    CL(uint) size = sizeof(ast_expression_new) + callee->size;
    for (int i = 0; i < numArgs; i++)
        size += arguments[i]->size;

    ast_expression_new* ast_new = (ast_expression_new*)malloc(size);
    ast_new->type.id = AST_NEW_EXPR;
    ast_new->type.size = size;

    CL(uint) offset = sizeof(ast_expression_new);
    char* dest = (char*)ast_new;

    // copy callee
    memcpy(dest + offset, callee, callee->size);
    ast_new->callee = offset;
    offset += callee->size;

    // copy arguments
    if (numArgs) {
        ast_new->arguments = offset;
        for (int i = 0; i < numArgs; i++) {
            ast_type* arg = arguments[i];
            memcpy(dest + offset, arg, arg->size);
            offset += arg->size;
        }
    } else
        ast_new->arguments = 0;
    ast_new->numArguments = numArgs;

    return ast_new;
}
Lateral AST structs
                 new Array(10)
offset       field                 value
  0         type.id     AST_NEW_EXPR (0x308)
  4        type.size               52
  8         callee             20 (offset)
 12       arguments            40 (offset)
 16      numArguments              1
 20       callee node    ast_identifier (“Array”)
          arguments
 40                      ast_literal_number (10)
             node
Lateral AST structs
Shared across the Host and the OpenCL runtime
    Host writes, Lateral reads
Constructed on Host as contiguous blobs
    Easy to send to GPU: memcpy(gpu, ast, ast->size);
    Fast to send to GPU, single buffer write
    Simple to traverse w/ pointer arithmetic
Stack-based
 Interpreter
Building Blocks
                     JS Type Structs


AST Traverse Stack                       Lateral State


 Call/Exec Stack        Heap           Symbol/Ref Table


  Return Stack                           Scope Stack




AST Traverse Loop                      Interpret Loop
Kernels
#include "state.h"
#include "jsvm/asttraverse.h"
#include "jsvm/interpreter.h"

// Setup VM structures
kernel void lateral_init(GPTR(uchar) lateral_heap) {
    LATERAL_STATE_INIT
}

// Interpret the AST
kernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) {
    LATERAL_STATE

    ast_push(lateral_ast);
    while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack,
call_q)) {
        while (!Q_EMPTY(lateral_state->ast_stack, ast_q))
            traverse();
        if (!Q_EMPTY(lateral_state->call_stack, call_q))
            interpret();
    }
}
Let’s interpret...



 var x = 1 + 2;
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST   Call   Return
    "declarations": [
        {
            "type": "VariableDeclarator",
            "id": {
                "type": "Identifier",
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",             AST      Call   Return
    "declarations": [
        {
            "type": "VariableDeclarator",     VarDecl
            "id": {
                "type": "Identifier",
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                     "type": "Literal",
                     "value": 1
                },
                "right": {
                     "type": "Literal",
                     "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",             AST      Call   Return
    "declarations": [
        {
            "type": "VariableDeclarator",     VarDtor
            "id": {
                "type": "Identifier",
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST       Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",     Ident    VarDtor
            "id": {
                "type": "Identifier",          Binary
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",             AST       Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",      Ident    VarDtor
            "id": {
                "type": "Identifier",          Literal    Binary
            },
                "name": "x"
                                              Literal
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",             AST        Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",      Ident    VarDtor
            "id": {
                "type": "Identifier",          Literal    Binary
            },
                "name": "x"
                                                         Literal
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST       Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",     Ident   VarDtor
            "id": {
                "type": "Identifier",                   Binary
            },
                "name": "x"
                                                       Literal
            "init": {
                "type": "BinaryExpression",
                                                       Literal
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST     Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",           VarDtor
            "id": {
                "type": "Identifier",                 Binary
            },
                "name": "x"
                                                     Literal
            "init": {
                "type": "BinaryExpression",
                                                     Literal
                "operator": "+",
                "left": {
                                                      Ident
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST     Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",           VarDtor     “x”
            "id": {
                "type": "Identifier",                 Binary
            },
                "name": "x"
                                                     Literal
            "init": {
                "type": "BinaryExpression",
                                                     Literal
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST     Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",           VarDtor     “x”
            "id": {
                "type": "Identifier",                 Binary      1
            },
                "name": "x"
                                                     Literal
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST    Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",           VarDtor    “x”
            "id": {
                "type": "Identifier",                 Binary     1
            },
                "name": "x"
                                                                2
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST    Call     Return
    "declarations": [
        {
            "type": "VariableDeclarator",           VarDtor    “x”
            "id": {
                "type": "Identifier",                            3
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
var x = 1 + 2;
{
    "type": "VariableDeclaration",            AST   Call   Return
    "declarations": [
        {
            "type": "VariableDeclarator",
            "id": {
                "type": "Identifier",
                "name": "x"
            },
            "init": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 1
                },
                "right": {
                    "type": "Literal",
                    "value": 2
                }
            }
        }
    ],
    "kind": "var"
}
Benchmark
Benchmark

                 Small loop of FLOPs
var input = new Array(10);
for (var i = 0; i < input.length; i++) {
    input[i] = Math.pow((i + 1) / 1.23, 3);
}
Execution Time
               Lateral
   GPU CL                CPU CL                      V8
 ATI Radeon 6770m   Intel Core i7 4x2.4Ghz   Intel Core i7 4x2.4Ghz




116.571533ms        0.226007ms               0.090664ms
Execution Time
               Lateral
   GPU CL                CPU CL                      V8
 ATI Radeon 6770m   Intel Core i7 4x2.4Ghz   Intel Core i7 4x2.4Ghz




116.571533ms        0.226007ms               0.090664ms
What went wrong?
Everything
Stack-based AST Interpreter, no optimizations
Heavy global memory access, no optimizations
No data or task parallelism
Stack-based Interpreter
Slow as molasses
Memory hog Eclipse style
Heavy memory access
     “var x = 1 + 2;” == 30 stack hits alone!
     Too much dynamic allocation
No inline optimizations, just following the yellow brick AST
Straight up lazy

Replace with something better!
Bytecode compiler on Host
Bytecode register-based interpreter on Device
Too much global access
   Everything is dynamically allocated to global memory
   Register based interpreter & bytecode compiler can
   make better use of local and private memory
// 11.1207 seconds
size_t tid = get_global_id(0);
c[tid] = a[tid];
while(b[tid] > 0) { // touch global memory on each loop
  b[tid]--; // touch global memory on each loop
  c[tid]++; // touch global memory on each loop       Optimizing memory access
}

// 0.0445558 seconds!! HOLY SHIT!
                                                      yields crazy results
size_t tid = get_global_id(0);
int tmp = a[tid]; // temp private variable
for(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loop
c[tid] = tmp; // touch global memory one time
No data or task parallelism
  Everything being interpreted in a single “thread”
  We have hundreds of cores available to us!
  Build in heuristics
         Identify side-effect free statements
         Break into parallel tasks - very magical

                                                    input[0] = Math.pow((0 + 1) / 1.23, 3);
var input = new Array(10);
for (var i = 0; i < input.length; i++) {            input[1] = Math.pow((1 + 1) / 1.23, 3);

}
    input[i] = Math.pow((i + 1) / 1.23, 3);
                                                                        ...
                                                    input[9] = Math.pow((9 + 1) / 1.23, 3);
What’s in store
Acceptable performance on all CL devices
V8/Node extension to launch Lateral tasks
High-level API to perform map-reduce, etc.
Lateral-cluster...mmmmm
Thanks!

  Jarred Nicholls
  @jarrednicholls
jarred@webkit.org

JavaScript on the GPU

  • 1.
    If you don’tget this ref...shame on you
  • 2.
    Jarred Nicholls @jarrednicholls jarred@webkit.org
  • 3.
    Work @ Sencha WebPlatform Team Doing webkitty things...
  • 4.
  • 5.
  • 6.
  • 7.
    What I’ll blabberabout today Why JavaScript on the GPU Running JavaScript on the GPU What’s to come...
  • 8.
  • 9.
    Why JavaScript onthe GPU? Better question: Why a GPU?
  • 10.
    Why JavaScript onthe GPU? Better question: Why a GPU? A: They’re fast! (well, at certain things...)
  • 11.
    GPUs are fastb/c... Totally different paradigm from CPUs Data parallelism vs. Task parallelism Stream processing vs. Sequential processing GPUs can divide-and-conquer Hardware capable of a large number of “threads” e.g. ATI Radeon HD 6770m: 480 stream processing units == 480 cores Typically very high memory bandwidth Many, many GigaFLOPs
  • 12.
    GPUs don’t solveall problems Not all tasks can be accelerated by GPUs Tasks must be parallelizable, i.e.: Side effect free Homogeneous and/or streamable Overall tasks will become limited by Amdahl’s Law
  • 14.
  • 15.
  • 16.
    LateralJS Our Mission To makeJavaScript a first-class citizen on all GPUs and take advantage of hardware accelerated operations & data parallelization.
  • 17.
    Our Options OpenCL Nvidia CUDA AMD, Nvidia, Intel, etc. Nvidia only A shitty version of C99 C++ (C for CUDA) No dynamic memory Dynamic memory No recursion Recursion No function pointers Function pointers Terrible tooling Great dev. tooling Immature (arguably) More mature (arguably)
  • 18.
    Our Options OpenCL Nvidia CUDA AMD, Nvidia, Intel, etc. Nvidia only A shitty version of C99 C++ (C for CUDA) No dynamic memory Dynamic memory No recursion Recursion No function pointers Function pointers Terrible tooling Great dev. tooling Immature (arguably) More mature (arguably)
  • 19.
    Why not aStatic Compiler? We want full JavaScript support Object / prototype Closures Recursion Functions as objects Variable typing Type Inference limitations Reasonably limited to size and complexity of “kernel- esque” functions Not nearly insane enough
  • 21.
    Why an Interpreter? Wewant it all baby - full JavaScript support! Most insane approach Challenging to make it good, but holds a lot of promise
  • 22.
  • 24.
    Oh the agony... Multiplememory spaces - pointer hell No recursion - all inlined functions No standard libc libraries No dynamic memory No standard data structures - apart from vector ops Buggy ass AMD/Nvidia compilers
  • 26.
    Multiple Memory Spaces Inthe order of fastest to slowest: space description very fast private stream processor cache (~64KB) scoped to a single work item fast local ~= L1 cache on CPUs (~64KB) scoped to a single work group slow, by orders of magnitude global ~= system memory over slow bus constant available to all work groups/items all the VRAM on the card (MBs)
  • 27.
    Memory Space PointerHell global uchar* gptr = 0x1000; local uchar* lptr = (local uchar*) gptr; // FAIL! uchar* pptr = (uchar*) gptr; // FAIL! private is implicit 0x1000 global local private 0x1000 points to something different depending on the address space!
  • 28.
    Memory Space PointerHell Pointers must always be fully qualified Macros to help ease the pain #define GPTR(TYPE) global TYPE* #define CPTR(TYPE) constant TYPE* #define LPTR(TYPE) local TYPE* #define PPTR(TYPE) private TYPE*
  • 29.
    No Recursion!?!?!? No call stack All functions are inlined to the kernel function uint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error }
  • 30.
    No standard libclibraries memcpy? strcpy? strcmp? etc...
  • 31.
    No standard libclibraries Implement our own #define MEMCPY(NAME, DEST_AS, SRC_AS) DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { DEST_AS uchar* cDest = (DEST_AS uchar*)dest; SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; for (uint i = 0; i < size; i++) cDest[i] = cSrc[i]; return (DEST_AS void*)cDest; } PTR_MACRO_DEST_SRC(MEMCPY, memcpy) Produces memcpy_g memcpy_gc memcpy_lc memcpy_pc memcpy_l memcpy_gl memcpy_lg memcpy_pg memcpy_p memcpy_gp memcpy_lp memcpy_pl
  • 32.
    No dynamic memory Nomalloc() No free() What to do...
  • 33.
    Yes! dynamic memory Create a large buffer of global memory - our “heap” Implement our own malloc() and free() Create a handle structure - “virtual memory” P(T, hnd) macro to get the current pointer address GPTR(handle) hnd = malloc(sizeof(uint)); GPTR(uint) ptr = P(uint, hnd); *ptr = 0xdeadbeef; free(hnd);
  • 35.
    Ok, we getthe point... FYL!
  • 36.
    High-level Architecture V8 Data Heap Esprima Parser Stack-based Interpreter Host Host Host GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  • 37.
    High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Host Host Host GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  • 38.
    High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  • 39.
    High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr
  • 40.
    High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr Fetch Result
  • 41.
  • 42.
    AST Generation JSON AST JavaScript Source (v8::Object) Lateral AST Esprima in V8 (C structs)
  • 43.
    Embed esprima.js Resource Generator $ resgen esprima.js resgen_esprima_js.c
  • 44.
    Embed esprima.js resgen_esprima_js.c const unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0 };
  • 45.
    Embed esprima.js ASTGenerator.cpp extern const char resgen_esprima_js; void ASTGenerator::init() { HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true; }
  • 46.
    Build JSON AST e.g. ASTGenerator::esprimaParse( "var xyz = new Array(10);" );
  • 47.
    Build JSON AST Handle<Object>ASTGenerator::esprimaParse(const char* javascript) { if (!s_initialized) init(); HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima- >Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1, (Handle<Value>*)&code)); s_context->Exit(); return scope.Close(ast); }
  • 48.
    Build JSON AST { "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var" }
  • 49.
    Lateral AST structs typedefstruct ast_type_st { #ifdef __OPENCL_VERSION__ CL(uint) id; #define CL(TYPE) TYPE CL(uint) size; #else } ast_type; #define CL(TYPE) cl_##TYPE #endif typedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody; Structs shared between } ast_program; Host and OpenCL typedef struct ast_identifier_st { ast_type type; CL(uint) name; } ast_identifier;
  • 50.
    Lateral AST structs v8::Object => ast_type expanded ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array"); ast_type* vd1_1_init_args[1]; vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10); ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1); free(vd1_1_init_id); for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]); ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz"); ast_type* vd1_decls[1]; vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init); free(vd1_1_id); free(vd1_1_init); ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var"); for (int i = 0; i < 1; i++) free(vd1_decls[i]);
  • 51.
    Lateral AST structs astCreateIdentifier ast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size); // copy the string strcpy((char*)(ast_id + 1), str); // fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset return ast_id; }
  • 52.
    Lateral AST structs astCreateIdentifier(“xyz”) offset field value 0 type.id AST_IDENTIFIER (0x01) 4 type.size 16 8 name 12 (offset) 12 str[0] ‘x’ 13 str[1] ‘y’ 14 str[2] ‘z’ 15 str[3] ‘0’
  • 53.
    Lateral AST structs astCreateNewExpression ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size; ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size; CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new; // copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size; // copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs; return ast_new; }
  • 54.
    Lateral AST structs new Array(10) offset field value 0 type.id AST_NEW_EXPR (0x308) 4 type.size 52 8 callee 20 (offset) 12 arguments 40 (offset) 16 numArguments 1 20 callee node ast_identifier (“Array”) arguments 40 ast_literal_number (10) node
  • 55.
    Lateral AST structs Sharedacross the Host and the OpenCL runtime Host writes, Lateral reads Constructed on Host as contiguous blobs Easy to send to GPU: memcpy(gpu, ast, ast->size); Fast to send to GPU, single buffer write Simple to traverse w/ pointer arithmetic
  • 56.
  • 57.
    Building Blocks JS Type Structs AST Traverse Stack Lateral State Call/Exec Stack Heap Symbol/Ref Table Return Stack Scope Stack AST Traverse Loop Interpret Loop
  • 58.
    Kernels #include "state.h" #include "jsvm/asttraverse.h" #include"jsvm/interpreter.h" // Setup VM structures kernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT } // Interpret the AST kernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack, call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); } }
  • 59.
  • 60.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 61.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDecl "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 62.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 63.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 64.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 65.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 66.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 67.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { Ident "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 68.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 69.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 70.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" 2 "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 71.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", 3 "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 72.
    var x =1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  • 73.
  • 74.
    Benchmark Small loop of FLOPs var input = new Array(10); for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3); }
  • 75.
    Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz 116.571533ms 0.226007ms 0.090664ms
  • 76.
    Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz 116.571533ms 0.226007ms 0.090664ms
  • 78.
    What went wrong? Everything Stack-basedAST Interpreter, no optimizations Heavy global memory access, no optimizations No data or task parallelism
  • 79.
    Stack-based Interpreter Slow asmolasses Memory hog Eclipse style Heavy memory access “var x = 1 + 2;” == 30 stack hits alone! Too much dynamic allocation No inline optimizations, just following the yellow brick AST Straight up lazy Replace with something better! Bytecode compiler on Host Bytecode register-based interpreter on Device
  • 81.
    Too much globalaccess Everything is dynamically allocated to global memory Register based interpreter & bytecode compiler can make better use of local and private memory // 11.1207 seconds size_t tid = get_global_id(0); c[tid] = a[tid]; while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop Optimizing memory access } // 0.0445558 seconds!! HOLY SHIT! yields crazy results size_t tid = get_global_id(0); int tmp = a[tid]; // temp private variable for(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loop c[tid] = tmp; // touch global memory one time
  • 82.
    No data ortask parallelism Everything being interpreted in a single “thread” We have hundreds of cores available to us! Build in heuristics Identify side-effect free statements Break into parallel tasks - very magical input[0] = Math.pow((0 + 1) / 1.23, 3); var input = new Array(10); for (var i = 0; i < input.length; i++) { input[1] = Math.pow((1 + 1) / 1.23, 3); } input[i] = Math.pow((i + 1) / 1.23, 3); ... input[9] = Math.pow((9 + 1) / 1.23, 3);
  • 83.
    What’s in store Acceptableperformance on all CL devices V8/Node extension to launch Lateral tasks High-level API to perform map-reduce, etc. Lateral-cluster...mmmmm
  • 84.
    Thanks! JarredNicholls @jarrednicholls jarred@webkit.org