KEMBAR78
Large-scale Array-oriented Computing with Python | PDF
Large-scale array-oriented
                      computing with Python

                              Travis E. Oliphant
                          PyCon Taiwan, June 9, 2012




Friday, June 8, 12
My Roots




Friday, June 8, 12
My Roots
                     Images from BYU Mers Lab




Friday, June 8, 12
Science led to Python
                     2
   ⇢0 (2⇡f ) Ui (a, f ) = [Cijkl (a, f ) Uk,l (a, f )],j

           Raja Muthupillai


                                 Richard Ehman
                                      1997



                                                      Armando Manduca


Friday, June 8, 12
Finding derivatives of 5-d data
                                  ⌅=r⇥U




Friday, June 8, 12
Scientist at heart




Friday, June 8, 12
Python origins.                      http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html



            Version        Date
                 0.9.0   Feb. 1991
                 0.9.4   Dec. 1991
                 0.9.6   Apr. 1992
                 0.9.8   Jan. 1993
                 1.0.0   Jan. 1994
                  1.2    Apr. 1995
                  1.4    Oct. 1996
                 1.5.2   Apr. 1999


Friday, June 8, 12
Python origins.                      http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html



            Version        Date
                 0.9.0   Feb. 1991
                 0.9.4   Dec. 1991
                 0.9.6   Apr. 1992
                 0.9.8   Jan. 1993
                 1.0.0   Jan. 1994
                  1.2    Apr. 1995
                  1.4    Oct. 1996
                 1.5.2   Apr. 1999


Friday, June 8, 12
Brief History

                          Person               Package       Year
                                             Matrix Object
                        Jim Fulton                           1994
                                              in Python
                      Jim Hugunin              Numeric       1995
                     Perry Greenfield, Rick
                      White, Todd Miller      Numarray       2001

                     Travis Oliphant            NumPy        2005



Friday, June 8, 12
1999 : Early SciPy emerges
        Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
     environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
                  and others. Activity in 1998, led to increased interest in 1999.

     In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
    present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
                  be creating this uber-package which eventually became SciPy

               Gaussian quadrature                5 Jan 1999
                       cephes 1.0                30 Jan 1999
                                                                                   Plotting??
                      sigtools 0.40              23 Feb 1999
                     Numeric docs
                       cephes 1.1
                                                 March 1999
                                                 9 Mar 1999
                                                                                     Gist
                      multipack 0.3              13 Apr 1999                        XPLOT
                     Helper routines             14 Apr 1999                        DISLIN
          multipack 0.6 (leastsq, ode, fsolve,
                          quad)
                                                 29 Apr 1999
                                                                                    Gnuplot
               sparse plan described             30 May 1999
                      multipack 0.7
                      SparsePy 0.1
                                                 14 Jun 1999
                                                 5 Nov 1999                   Helping with f2py
                cephes 1.2 (vectorize)           29 Dec 1999




Friday, June 8, 12
SciPy 2001      Travis Oliphant
                           optimize
                            sparse
                         interpolate
                          integrate
                            special
                             signal
                              stats      Founded in 2001 with Travis Vaught
                            fftpack
                              misc




                                                         Eric Jones
                                                           weave
                                                          cluster
      Pearu Peterson
                                                            GA*
           linalg
        interpolate
            f2py



Friday, June 8, 12
Community effort
        •    Chuck Harris
        •    Pauli Virtanen
        •    David Cournapeau
        •    Stefan van der Walt
        •    Dag Sverre Seljebotn
        •    Robert Kern
        •    Warren Weckesser
        •    Ralf Gommers
        •    Mark Wiebe
        •    Nathaniel Smith



Friday, June 8, 12
Why Python for Technical Computing
        • Syntax (it gets out of your way)
        • Over-loadable operators
        • Complex numbers built-in early
        • Just enough language support for arrays
        • “Occasional” programmers can grok it
        • Supports multiple programming styles
        • Expert programmers can also use it effectively
        • Has a simple, extensible implementation
        • General-purpose language --- can build a system
        • Critical mass

Friday, June 8, 12
What is wrong with Python?
        • Packaging is still not solved well (distribute, pip, and
             distutils2 don’t cut it)
        •    Missing anonymous blocks
        •    The CPython run-time is aged and needs an overhaul
             (GIL, global variables, lack of dynamic compilation
             support)
        •    No approach to language extension except for
             “import hooks” (lightweight DSL need)
        •    The distraction of multiple run-times...
        •    Array-oriented and NumPy not really understood by
             most Python devs.


Friday, June 8, 12
Putting Science back in Comp Sci
                 • Much of the software stack is for systems
                   programming --- C++, Java, .NET, ObjC, web
                    - Complex numbers?
                    - Vectorized primitives?
                 • Array-oriented programming has been
                   supplanted by Object-oriented programming
                 • Software stack for scientists is not as helpful
                   as it should be
                 • Fortran is still where many scientists end up


Friday, June 8, 12
Array-Oriented Computing

                     Example1: Fibonacci Numbers

                           fn    =    fn   1   + fn   2
                           f0    =    0
                           f1    =    1


                     f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .




Friday, June 8, 12
Common Python approaches
                     Recursive    Iterative




            Algorithm matters!!




Friday, June 8, 12
Array-oriented approaches
                                     Using Formula

                     Using LFilter




Friday, June 8, 12
Array-oriented approaches




Friday, June 8, 12
NumPy: an Array-Oriented Extension
          • Data: the array object
                     – slicing and shaping
                     – data-type map to Bytes

          • Fast Math:
                     – vectorization
                     – broadcasting
                     – aggregations




Friday, June 8, 12
NumPy Array




                     shape




Friday, June 8, 12
Zen of NumPy
          •    strided is better than scattered
          •    contiguous is better than strided
          •    descriptive is better than imperative
          •    array-oriented is better than object-oriented
          •    broadcasting is a great idea
          •    vectorized is better than an explicit loop
          •    unless it’s too complicated --- then use Cython/Numba
          •    think in higher dimensions




Friday, June 8, 12
More NumPy Demonstration




Friday, June 8, 12
Conway’s game of Life
               • Dead cell with exactly 3 live neighbors
                 will come to life
               • A live cell with 2 or 3 neighbors will
                 survive
               • With too few or too many neighbors, the
                 cell dies




Friday, June 8, 12
Interesting Patterns emerge




Friday, June 8, 12
APL : the first array-oriented language
     • Appeared in 1964
     • Originated by Ken Iverson
     • Direct descendants (J, K, Matlab) are still
       used heavily and people pay a lot of money
       for them                            APL
     • NumPy is a descendent                J

                                      K   Matlab
                                                   Numeric
                                                   NumPy




Friday, June 8, 12
Conway’s Game of Life
     APL


     NumPy
                      Initialization


        Update Step




Friday, June 8, 12
Demo

             Python Version




            Array-oriented NumPy Version




Friday, June 8, 12
Memory using Object-oriented

                                       Object
                     Object                                  Object
                                       Attr1
                     Attr1                                   Attr1
                                       Attr2
                     Attr2                                   Attr2
                                       Attr3
                     Attr3                                   Attr3


                                                    Object
                                                    Attr1
                              Object
                                                    Attr2
                              Attr1        Object
                                                    Attr3
                              Attr2         Attr1
                              Attr3         Attr2
                                            Attr3



Friday, June 8, 12
Array-oriented (Table) approach
                               Attr1   Attr2   Attr3
                     Object1
                     Object2
                     Object3
                     Object4
                     Object5
                     Object6



Friday, June 8, 12
Benefits of Array-oriented

         • Many technical problems are naturally array-
           oriented (easy to vectorize)
         • Algorithms can be expressed at a high-level
         • These algorithms can be parallelized more
           simply (quite often much information is lost in
           the translation to typical “compiled” languages)
         • Array-oriented algorithms map to modern
           hard-ware caches and pipelines.




Friday, June 8, 12
We need more focus on
                     complied array-oriented
                     languages with fast compilers!




Friday, June 8, 12
What is good about NumPy?
           • Array-oriented
           • Extensive Dtype System (including structures)
           • C-API
           • Simple to understand data-structure
           • Memory mapping
           • Syntax support from Python
           • Large community of users
           • Broadcasting
           • Easy to interface C/C++/Fortran code


Friday, June 8, 12
What is wrong with NumPy
             • Dtype system is difficult to extend
             • Immediate mode creates huge temporaries
               (spawning Numexpr)
             • “Almost” an in-memory data-base comparable
               to SQL-lite (missing indexes)
             • Integration with sparse arrays
             • Lots of un-optimized parts
             • Minimal support for multi-core / GPU
             • Code-base is organic and hard to extend


Friday, June 8, 12
Improvements needed
        • NDArray improvements
          • Indexes (esp. for Structured arrays)
          • SQL front-end
          • Multi-level, hierarchical labels
          • selection via mappings (labeled arrays)
          • Memory spaces (array made up of regions)
          • Distributed arrays (global array)
          • Compressed arrays
          • Standard distributed persistance
          • fancy indexing as view and optimizations
          • streaming arrays


Friday, June 8, 12
Improvements needed
          • Dtype improvements
            • Enumerated types (including dynamic enumeration)
            • Derived fields
            • Specification as a class (or JSON)
            • Pointer dtype (i.e. C++ object, or varchar)
            • Finishing datetime
            • Missing data with bit-patterns
            • Parameterized field names




Friday, June 8, 12
Example of Object-defined Dtype

                     @np.dtype
                     class Stock(np.DType):
                           symbol = np.Str(4)
                           open = np.Int(2)
                           close = np.Int(2)
                           high = np.Int(2)
                           low = np.Int(2)
                           @np.Int(2)
                           def mid(self):
                               return (self.high + self.low) / 2.0




Friday, June 8, 12
Improvements needed
          • Ufunc improvements
            • Generalized ufuncs support more than just
              contiguous arrays
            • Specification of ufuncs in Python
            • Move most dtype “array functions” to ufuncs
            • Unify error-handling for all computations
            • Allow lazy-evaluation and remote computation ---
              streaming and generator data
            • Structured and string dtype ufuncs
            • Multi-core and GPU optimized ufuncs
            • Group-by reduction



Friday, June 8, 12
More Improvements needed
          • Miscellaneous improvements
            • ABI-management
            • Eventual Move to library (NDLib)?
            • Integration with LLVM
            • Sparse dimensions
            • Remote computation
            • Fast I/O for CSV and Excel
            • Out-of-core calculations
            • Delayed-mode execution




Friday, June 8, 12
New Project



               NumPy
                             Blaze
                       Next Generation NumPy
                            Out-of-core
                         Distributed Tables


Friday, June 8, 12
Blaze Main Features
           • New ndarray with multiple memory segments
           • Distributed ndtable which can span the world
           • Fast, out-of-core algorithms for all functions
           • Delayed-mode execution: expressions build up
             graph which gets executed where the data is
           • Built-in Indexes (beyond searchsorted)
           • Built-in labels (data-array)
           • Sparse dimensions (defined by attributes or
             elements of another dimension)
           • Direct adapters to all data (move code to data)


Friday, June 8, 12
Delayed execution
                            Demo
                           Code Only




Friday, June 8, 12
Dimensions defined by Attributes
   dim1
                     Day   Month   Year   High   Low

                     15      3     2012    30    20

                     16      3     2012    35    25

                     20      3     2012    40    30

                     21      3     2012    41    29



Friday, June 8, 12
Outline
                                    NDTable


                         NDArray              Domain

                 GFunc             DType

                          Bytes




Friday, June 8, 12
NDTable (Example)

                        Proc0   Proc1   Proc2   Proc3
      Each Partition:
      • Remote          Proc0   Proc1   Proc2   Proc3
      • Expression
      • NDArray
                        Proc0   Proc1   Proc2   Proc3


                        Proc4   Proc4   Proc4   Proc4




Friday, June 8, 12
Data URLs

            • Variables in script are global addresses (DATA
                 URLs). All the world’s data you can see via web
                 can be in used as part of an algorithm by
                 referencing it as a part of an array.
                 • Dynamically interpret bytes as data-type
                 • Scheduler will push code based on data-type
                   to the data instead of pulling data to the code.




Friday, June 8, 12
Overview
                                            Processing
                              Code
                                              Node       Processing
                                     Code
                                                           Node
                     Main            Code   Processing
                     Script                   Node
                                     Code
                                                         Processing
                                            Processing     Node
                                              Node



Friday, June 8, 12
NDArray

                • Local ndarray (NumPy++)
                • Multiple byte-buffers (streaming or random
                  access)
                • Variable-length arrays
                • All kinds of data-types (everything...)
                • Multiple patterns of memory access possible
                  (Z-order, Fortran-order, C-order)
                • Sparse dimensions


Friday, June 8, 12
GFunc
               • Generalized Function
               • All NumPy functions
                 • element-by-element
                 • linear algebra
                 • manipulation
                 • Fourier Transform
               • Iteration and Dispatch to low-level kernels
               • Kernels can be written in anything that builds a
                     C-like interface



Friday, June 8, 12
Early Timeline

                         Date           Milestone

                       July 2012     Pre-alpha release


                     December 2012   Early Beta Release


                       June 2013        Version 1.0




Friday, June 8, 12
PyData


                      All computing modules known to work with
                     Blaze will be placed under PyData umbrella of
                            projects over the coming years.




Friday, June 8, 12
Introducing Numba
                     (lots of kernels to write)




Friday, June 8, 12
NumPy Users

              • Want to be able to write Python to get fast
                     code that works on arrays and scalars
              •      Need access to a boat-load of C-extensions
                     (NumPy is just the beginning)


                              PyPy doesn’t cut it for us!




Friday, June 8, 12
Friday, June 8, 12
                                       Ufuncs


                                     Generalized
                                      UFuncs
                                                                           Python
                                                                          Function
                                      Window
                                      Kernel
                                       Funcs

                                      Function-
                                        based
                                      Indexing


                                      Memory
                                                                                     Dynamic compilation




                                       Filters
                                                                   Dynamic
                                                                  Compilation




                     NumPy Runtime
                                     I/O Filters



                                     Reduction
                                      Filters


                                     Computed
                                     Columns
                                                   function pointer
SciPy needs a Python compiler

                     optimize                   integrate


                     special                       ode



                     writing more of SciPy at high-level




Friday, June 8, 12
Numba -- a Python compiler

                • Replays byte-code on a stack with simple type-
                  inference
                • Translates to LLVM (using LLVM-py)
                • Uses LLVM for code-gen
                • Resulting C-level function-pointer can be
                  inserted into NumPy run-time
                • Understands NumPy arrays
                • Is NumPy / SciPy aware


Friday, June 8, 12
NumPy + Mamba = Numba
                     Python Function                            Machine Code


                                              LLVM-PY

                                              LLVM 3.1
                           ISPC      OpenCL    OpenMP    CUDA     CLANG

                             Intel       AMD        Nvidia      Apple



Friday, June 8, 12
Examples




Friday, June 8, 12
Examples




Friday, June 8, 12
Software Stack Future?
                           Plateaus of Code re-use + DSLs
                     SQL                                R
                               TDPL                               Matlab


                                       Python


                                OBJC               C
                     FORTRAN                                C++



                                       LLVM



Friday, June 8, 12
How to pay for all this?




Friday, June 8, 12
Dual strategy




                       Blaze


Friday, June 8, 12
NumFOCUS
    Num(Py) Foundation for Open Code for Usable Science




Friday, June 8, 12
NumFOCUS

           • Mission
             • To initiate and support educational programs
               furthering the use of open source software in
               science.
             • To promote the use of high-level languages and
               open source in science, engineering, and math
               research
             • To encourage reproducible scientific research
             • To provide infrastructure and support for open
               source projects for technical computing



Friday, June 8, 12
NumFOCUS
              • Activites
                • Sponsor sprints and conferences
                • Provide scholarships and grants for people using
                  these tools
                • Pay for documentation development and basic
                  course development
                • Fund continuous integration and build systems
                • Work with domain-specific organizations
                • Raise funds from industries using Python and
                  NumPy



Friday, June 8, 12
NumFOCUS

             Core Projects



                     NumPy     SciPy         IPython      Matplotlib

            Other Projects (seeking more --- need representatives)


                                    Scikits Image




Friday, June 8, 12
NumFOCUS

                 • Directors
                   • Perry Greenfield
                   • John Hunter
                   • Jarrod Millman
                   • Travis Oliphant
                   • Fernando Perez
                 • Members
                   • Basically people who donate for now. In time, a
                     body that elects directors.



Friday, June 8, 12
•   Large-scale data analysis products
                     •   Python and NumPy training
                     •   NumPy support and consulting
                     •   Rich-client or web user-interfaces
                     •   Blaze and PyData Development



Friday, June 8, 12

Large-scale Array-oriented Computing with Python

  • 1.
    Large-scale array-oriented computing with Python Travis E. Oliphant PyCon Taiwan, June 9, 2012 Friday, June 8, 12
  • 2.
  • 3.
    My Roots Images from BYU Mers Lab Friday, June 8, 12
  • 4.
    Science led toPython 2 ⇢0 (2⇡f ) Ui (a, f ) = [Cijkl (a, f ) Uk,l (a, f )],j Raja Muthupillai Richard Ehman 1997 Armando Manduca Friday, June 8, 12
  • 5.
    Finding derivatives of5-d data ⌅=r⇥U Friday, June 8, 12
  • 6.
  • 7.
    Python origins. http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 Friday, June 8, 12
  • 8.
    Python origins. http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 Friday, June 8, 12
  • 9.
    Brief History Person Package Year Matrix Object Jim Fulton 1994 in Python Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005 Friday, June 8, 12
  • 10.
    1999 : EarlySciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 Plotting?? sigtools 0.40 23 Feb 1999 Numeric docs cephes 1.1 March 1999 9 Mar 1999 Gist multipack 0.3 13 Apr 1999 XPLOT Helper routines 14 Apr 1999 DISLIN multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 Gnuplot sparse plan described 30 May 1999 multipack 0.7 SparsePy 0.1 14 Jun 1999 5 Nov 1999 Helping with f2py cephes 1.2 (vectorize) 29 Dec 1999 Friday, June 8, 12
  • 11.
    SciPy 2001 Travis Oliphant optimize sparse interpolate integrate special signal stats Founded in 2001 with Travis Vaught fftpack misc Eric Jones weave cluster Pearu Peterson GA* linalg interpolate f2py Friday, June 8, 12
  • 12.
    Community effort • Chuck Harris • Pauli Virtanen • David Cournapeau • Stefan van der Walt • Dag Sverre Seljebotn • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith Friday, June 8, 12
  • 13.
    Why Python forTechnical Computing • Syntax (it gets out of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical mass Friday, June 8, 12
  • 14.
    What is wrongwith Python? • Packaging is still not solved well (distribute, pip, and distutils2 don’t cut it) • Missing anonymous blocks • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by most Python devs. Friday, June 8, 12
  • 15.
    Putting Science backin Comp Sci • Much of the software stack is for systems programming --- C++, Java, .NET, ObjC, web - Complex numbers? - Vectorized primitives? • Array-oriented programming has been supplanted by Object-oriented programming • Software stack for scientists is not as helpful as it should be • Fortran is still where many scientists end up Friday, June 8, 12
  • 16.
    Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 + fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . . Friday, June 8, 12
  • 17.
    Common Python approaches Recursive Iterative Algorithm matters!! Friday, June 8, 12
  • 18.
    Array-oriented approaches Using Formula Using LFilter Friday, June 8, 12
  • 19.
  • 20.
    NumPy: an Array-OrientedExtension • Data: the array object – slicing and shaping – data-type map to Bytes • Fast Math: – vectorization – broadcasting – aggregations Friday, June 8, 12
  • 21.
    NumPy Array shape Friday, June 8, 12
  • 22.
    Zen of NumPy • strided is better than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions Friday, June 8, 12
  • 23.
  • 24.
    Conway’s game ofLife • Dead cell with exactly 3 live neighbors will come to life • A live cell with 2 or 3 neighbors will survive • With too few or too many neighbors, the cell dies Friday, June 8, 12
  • 25.
  • 26.
    APL : thefirst array-oriented language • Appeared in 1964 • Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them APL • NumPy is a descendent J K Matlab Numeric NumPy Friday, June 8, 12
  • 27.
    Conway’s Game ofLife APL NumPy Initialization Update Step Friday, June 8, 12
  • 28.
    Demo Python Version Array-oriented NumPy Version Friday, June 8, 12
  • 29.
    Memory using Object-oriented Object Object Object Attr1 Attr1 Attr1 Attr2 Attr2 Attr2 Attr3 Attr3 Attr3 Object Attr1 Object Attr2 Attr1 Object Attr3 Attr2 Attr1 Attr3 Attr2 Attr3 Friday, June 8, 12
  • 30.
    Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6 Friday, June 8, 12
  • 31.
    Benefits of Array-oriented • Many technical problems are naturally array- oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines. Friday, June 8, 12
  • 32.
    We need morefocus on complied array-oriented languages with fast compilers! Friday, June 8, 12
  • 33.
    What is goodabout NumPy? • Array-oriented • Extensive Dtype System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code Friday, June 8, 12
  • 34.
    What is wrongwith NumPy • Dtype system is difficult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend Friday, June 8, 12
  • 35.
    Improvements needed • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistance • fancy indexing as view and optimizations • streaming arrays Friday, June 8, 12
  • 36.
    Improvements needed • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Finishing datetime • Missing data with bit-patterns • Parameterized field names Friday, June 8, 12
  • 37.
    Example of Object-definedDtype @np.dtype class Stock(np.DType): symbol = np.Str(4) open = np.Int(2) close = np.Int(2) high = np.Int(2) low = np.Int(2) @np.Int(2) def mid(self): return (self.high + self.low) / 2.0 Friday, June 8, 12
  • 38.
    Improvements needed • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reduction Friday, June 8, 12
  • 39.
    More Improvements needed • Miscellaneous improvements • ABI-management • Eventual Move to library (NDLib)? • Integration with LLVM • Sparse dimensions • Remote computation • Fast I/O for CSV and Excel • Out-of-core calculations • Delayed-mode execution Friday, June 8, 12
  • 40.
    New Project NumPy Blaze Next Generation NumPy Out-of-core Distributed Tables Friday, June 8, 12
  • 41.
    Blaze Main Features • New ndarray with multiple memory segments • Distributed ndtable which can span the world • Fast, out-of-core algorithms for all functions • Delayed-mode execution: expressions build up graph which gets executed where the data is • Built-in Indexes (beyond searchsorted) • Built-in labels (data-array) • Sparse dimensions (defined by attributes or elements of another dimension) • Direct adapters to all data (move code to data) Friday, June 8, 12
  • 42.
    Delayed execution Demo Code Only Friday, June 8, 12
  • 43.
    Dimensions defined byAttributes dim1 Day Month Year High Low 15 3 2012 30 20 16 3 2012 35 25 20 3 2012 40 30 21 3 2012 41 29 Friday, June 8, 12
  • 44.
    Outline NDTable NDArray Domain GFunc DType Bytes Friday, June 8, 12
  • 45.
    NDTable (Example) Proc0 Proc1 Proc2 Proc3 Each Partition: • Remote Proc0 Proc1 Proc2 Proc3 • Expression • NDArray Proc0 Proc1 Proc2 Proc3 Proc4 Proc4 Proc4 Proc4 Friday, June 8, 12
  • 46.
    Data URLs • Variables in script are global addresses (DATA URLs). All the world’s data you can see via web can be in used as part of an algorithm by referencing it as a part of an array. • Dynamically interpret bytes as data-type • Scheduler will push code based on data-type to the data instead of pulling data to the code. Friday, June 8, 12
  • 47.
    Overview Processing Code Node Processing Code Node Main Code Processing Script Node Code Processing Processing Node Node Friday, June 8, 12
  • 48.
    NDArray • Local ndarray (NumPy++) • Multiple byte-buffers (streaming or random access) • Variable-length arrays • All kinds of data-types (everything...) • Multiple patterns of memory access possible (Z-order, Fortran-order, C-order) • Sparse dimensions Friday, June 8, 12
  • 49.
    GFunc • Generalized Function • All NumPy functions • element-by-element • linear algebra • manipulation • Fourier Transform • Iteration and Dispatch to low-level kernels • Kernels can be written in anything that builds a C-like interface Friday, June 8, 12
  • 50.
    Early Timeline Date Milestone July 2012 Pre-alpha release December 2012 Early Beta Release June 2013 Version 1.0 Friday, June 8, 12
  • 51.
    PyData All computing modules known to work with Blaze will be placed under PyData umbrella of projects over the coming years. Friday, June 8, 12
  • 52.
    Introducing Numba (lots of kernels to write) Friday, June 8, 12
  • 53.
    NumPy Users • Want to be able to write Python to get fast code that works on arrays and scalars • Need access to a boat-load of C-extensions (NumPy is just the beginning) PyPy doesn’t cut it for us! Friday, June 8, 12
  • 54.
    Friday, June 8,12 Ufuncs Generalized UFuncs Python Function Window Kernel Funcs Function- based Indexing Memory Dynamic compilation Filters Dynamic Compilation NumPy Runtime I/O Filters Reduction Filters Computed Columns function pointer
  • 55.
    SciPy needs aPython compiler optimize integrate special ode writing more of SciPy at high-level Friday, June 8, 12
  • 56.
    Numba -- aPython compiler • Replays byte-code on a stack with simple type- inference • Translates to LLVM (using LLVM-py) • Uses LLVM for code-gen • Resulting C-level function-pointer can be inserted into NumPy run-time • Understands NumPy arrays • Is NumPy / SciPy aware Friday, June 8, 12
  • 57.
    NumPy + Mamba= Numba Python Function Machine Code LLVM-PY LLVM 3.1 ISPC OpenCL OpenMP CUDA CLANG Intel AMD Nvidia Apple Friday, June 8, 12
  • 58.
  • 59.
  • 60.
    Software Stack Future? Plateaus of Code re-use + DSLs SQL R TDPL Matlab Python OBJC C FORTRAN C++ LLVM Friday, June 8, 12
  • 61.
    How to payfor all this? Friday, June 8, 12
  • 62.
    Dual strategy Blaze Friday, June 8, 12
  • 63.
    NumFOCUS Num(Py) Foundation for Open Code for Usable Science Friday, June 8, 12
  • 64.
    NumFOCUS • Mission • To initiate and support educational programs furthering the use of open source software in science. • To promote the use of high-level languages and open source in science, engineering, and math research • To encourage reproducible scientific research • To provide infrastructure and support for open source projects for technical computing Friday, June 8, 12
  • 65.
    NumFOCUS • Activites • Sponsor sprints and conferences • Provide scholarships and grants for people using these tools • Pay for documentation development and basic course development • Fund continuous integration and build systems • Work with domain-specific organizations • Raise funds from industries using Python and NumPy Friday, June 8, 12
  • 66.
    NumFOCUS Core Projects NumPy SciPy IPython Matplotlib Other Projects (seeking more --- need representatives) Scikits Image Friday, June 8, 12
  • 67.
    NumFOCUS • Directors • Perry Greenfield • John Hunter • Jarrod Millman • Travis Oliphant • Fernando Perez • Members • Basically people who donate for now. In time, a body that elects directors. Friday, June 8, 12
  • 68.
    Large-scale data analysis products • Python and NumPy training • NumPy support and consulting • Rich-client or web user-interfaces • Blaze and PyData Development Friday, June 8, 12