Python for Science and Engineering
Dr Edward Schoeld A*STAR / Singapore Computational Sciences Club Seminar June 14, 2011
Scientic programming in 2011
Most scientists and engineers are: programming for 50+% of their work time (and rising) self-taught programmers using inefcient programming practices using the wrong programming languages: C++, FORTRAN, C#, PHP, Java, ...
Scientic programming needs
Rapid prototyping Efciency for computational kernels Pre-written packages! Vectors, matrices, modelling, simulations, visualisation Extensibility; web front-ends; database backends; ...
Ed's story: How I found Python
PhD in statistical pattern recognition: 2001-2006 Needed good tools for my research! Discovered Python in 2002 after frustration with C++, Matlab, Java, Perl Contributed to NumPy and SciPy: maxent, sparse matrices, optimization, Monte Carlo, etc. Managed six releases of SciPy in 2005-6
1. Why Python?
Introducing Python
What is it? What is it good for? Who uses it?
What is Python?
interpreted strongly but dynamically typed object-oriented intuitive, readable open source, free batteries included
batteries included
Pythons standard library is: very large well-supported well-documented
Pythons standard library
data types operating system CGI testing calendar strings compression complex numbers multimedia email networking GUI FTP databases XML threads arguments cryptography CSV les serialization
What is an efcient programming language?
Native Python code executes 10x more slowly than C and FORTRAN
Would you build a racing car ...
... to get to Kuala Lumpur ASAP?
Date 1961 1984 1997 2000, Apr 2003, Aug 2007, Mar 2009, Sep
Cost per GFLOPS (US $) US $1.1 trillion US $15,000,000 US $30,000 $1000 $82 $0.42 $0.13
Technology 17 million IBM 1620s Cray X-MP Two 16-CPU clusters of Pentiums Bunyip Beowulf cluster KASY0 Ambric AM2045 ATI Radeon R800 Source: Wikipedia: FLOPS
Unit labor cost growth
Proxy for cost of programmer time
Efciency
When FORTRAN was invented, computer time was more expensive than programmer time. In the 1980s and 1990s that reversed.
Efcient programming
Python code is 10x faster to write than C and FORTRAN
What if ...
... you now need to reach Sydney?
Advantages of Python
Easy to write Easy to maintain Great standard libraries Thriving ecosystem of third-party packages Open source
Batteries included
Pythons standard library is: very large well supported well documented
Pythons standard library
data types operating system CGI testing calendar strings compression complex numbers multimedia email networking GUI FTP databases XML threads arguments cryptography CSV les serialization
Question
What is the date 177 days from now?
Natural applications of Python
Rapid prototyping Plotting, visualisation, 3D Numerical computing Web and database programming All-purpose glue
Python vs other languages
Languages used at CSIRO
Python Fortran Java
Matlab IDL Perl
C C++ C#
VB.net R +5-10 others!
Which language do I choose?
A different language for each task? A language you know? A language others in your team are using: support and help?
Python Interpreted Powerful data input/output Great plotting General-purpose language Cost Open source Yes Yes Yes Powerful Free Yes
Matlab Yes Yes Yes Limited $$$ No
Python Powerful Portable Standard libraries Easy to write and maintain Easy to learn Yes Yes Vast Yes Yes
C++ Yes In theory Limited No No
Python
Fast to write Good for embedded systems, device drivers and operating systems Good for most other high-level tasks
Yes
No
No
Yes
Yes
No
Standard library
Vast
Limited
Python Powerful, well-designed language Standard libraries Easy to learn Code brevity Easy to write and maintain Yes Vast Yes Short Yes
Java Yes Vast No Verbose Okay
Open source
Python is open source software Benets: No vendor lock-in Cross-platform Insurance against bugs in the platform Free
Python success stories
Computer graphics: Industrial Light & Magic Web: Google: News, Groups, Maps, Gmail Legacy system integration: AstraZeneca - collaborative drug discovery
Python success stories (2)
Aerospace: NASA Research: universities worldwide ... Others: YouTube, Reddit, BitTorrent, Civilization IV,
Industrial Light & Magic
Python spread from scripting to the entire production pipeline Numerous reviews since 1996: Python is still the best tool for them
United Space Alliance
A common sentiment: We achieve immediate functioning code so much faster in Python than in any other language that its staggering. - Robin Friedrich, Senior Project Engineer
Case study: air-trafc control
Eric Newton, Python for Critical Applications: http://
metaslash.com/brochure/ recall.html
Metaslash, Inc: 1999 to 2001 Mission-critical system for air-trafc control Replicated, fault-tolerant data storage
Case study: air-trafc control
Python prototype -> C++ implementation -> Python again Why? C++ dependencies were buggy C++ threads, STL were not portable enough Pythons advantages over C++ More portable 75% less code: more productivity, fewer bugs
More case studies
See http://www.python.org/about/success/ for lots more case studies and success stories
2. The scientic Python ecosystem
Scientic software development
Small beginnings Piecemeal growth, quirky interfaces ... Large, cumbersome systems
NumPy
An n-dimensional array/matrix package
NumPy
Centre of Pythons numerical computing ecosystem
NumPy
The most fundamental tool for numerical computing in Python Fast multi-dimensional array capability
What NumPy denes:
Two fundamental objects: 1. n-dimensional array
2. universal function
a rich set of numerical data types nearly 400 functions and methods on arrays: type conversions mathematical logical
NumPy's features
Fast. Written in C with BLAS/LAPACK hooks. Rich set of data types Linear algebra: matrix inversion, decompositions, Discrete Fourier transforms Random number generation Trig, hypergeometric functions, etc.
Elementwise array operations
Loops are mostly unnecessary Operate on entire arrays!
>>> a = numpy.array([20, 30, 40, 50]) >>> a < 35 array([True, True, False, False], dtype=bool) >>> b = numpy.arange(4) >>> a - b array([20, 29, 38, 47]) >>> b**2 array([0, 1, 4, 9])
Universal functions
NumPy denes 'ufuncs' that operate on entire arrays and other sequences (hence 'universal') Example: sin()
>>> a = numpy.array([20, 30, 40, 50]) >>> c = 10 * numpy.sin(a) >>> c array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
Array slicing
Arrays can be sliced and indexed powerfully:
>>> a = numpy.arange(10)**3 >>> a array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729]) >>> a[2:5] array([ 8, 27, 64])
Fancy indexing
Arrays can be used as indices into other arrays:
>>> a = numpy.arange(12)**2 >>> ind = numpy.array([ 1, 1, 3, 8, 5 ]) >>> a[ind] array([ 1, 1, 9, 64, 25])
Other linear algebra features
Matrix inversion: mat(A).I Or: linalg.inv(A) Linear solvers: linalg.solve(A, x)
Pseudoinverse: linalg.pinv(A)
What is SciPy?
A community A conference A package of scientic libraries
Python for scientic software
Back-end: computational work Front-end: input / output, visualization, GUIs Dozens of great scientic packages exist
Python in science (2)
NumPy: numerical / array module Matplotlib: great 2D and 3D plotting library IPython: nice interactive Python shell SciPy: set of scientic libraries: sparse matrices, signal processing, RPy: integration with the R statistical environment
Python in science (3)
Cython: C language extensions Mayavi: 3D graphics, volumetric rendering Nitimes, Nipype: Python tools for neuroimaging SymPy: symbolic mathematics library
Python in science (4)
VPython: easy, real-time 3D programming UCSF Chimera, PyMOL, VMD: molecular graphics PyRAF: Hubble Space Telescope interface to RAF astronomical data BioPython: computational molecular biology Natural language toolkit: symbolic + statistical NLP Physics: PyROOT
The SciPy package
BSD-licensed software for maths, science, engineering
integration optimization interpolation FFTs clustering signal processing linear algebra ODEs n-dim image processing interpolation sparse matrices maximum entropy statistics scientic constants C/C++ and Fortran integration
SciPy optimisation example
Fit a model to noisy data: y = a/xb sin(cx)+
Example: tting a model with
scipy.optimize
Task: Fit a model of the form y = a/bx sin(cx)+ to noisy data. Spec: 1. Generate noisy data 2. Choose parameters (a, b, c) to minimize sum squared errors 3. Plot the data and tted model (next session)
SciPy optimisation example
import numpy import pylab from scipy.optimize import leastsq def myfunc(params, x): (a, b, c) = params return a / (x**b) * numpy.sin(c * x) true_params = [1.5, 0.1, 2.] def f(x): return myfunc(true_params, x) def err(params, x, y): # error function return myfunc(params, x) - y
SciPy optimisation example
# n x y y Generate noisy data to fit = 30; xmin = 0.1; xmax = 5 = numpy.linspace(xmin, xmax, n) = f(x) += numpy.rand(len(x)) * 0.2 * \ (y.max() - y.min())
v0 = [3., 1., 4.] # initial param estimate # Fitting v, success = leastsq(err, v0, args=(x, y), maxfev=10000) print 'Estimated parameters: ', v print 'True parameters: ', true_params X = numpy.linspace(xmin, xmax, 5 * n) pylab.plot(x, y, 'ro', X, myfunc(v, X)) pylab.show()
SciPy optimisation example
Fit a model to noisy data: y = a/xb sin(cx)+
Ingredients for this example
numpy.linspace numpy.random.rand for the noise model (uniform) scipy.optimize.leastsq
Sparse matrix example
Construct and solve a sparse linear system
Sparse matrices
Sparse matrices are mostly zeros. They can be symmetric or asymmetric. Sparsity patterns vary: block sparse, band matrices, ... They can be huge! Only non-zeros are stored.
Sparse matrices in SciPy
SciPy supports seven sparse storage schemes ... and sparse solvers in Fortran.
Sparse matrix creation
To construct a 1000x1000 lil_matrix and add values:
>>> from scipy.sparse import lil_matrix >>> from numpy.random import rand >>> from scipy.sparse.linalg import spsolve >>> >>> >>> >>> A = lil_matrix((1000, 1000)) A[0, :100] = rand(100) A[1, 100:200] = A[0, :100] A.setdiag(rand(1000))
Solving sparse matrix systems
Now convert the matrix to CSR format and solve Ax=b:
>>> A = A.tocsr() >>> b = rand(1000) >>> x = spsolve(A, b) # Convert it to a dense matrix and solve, and check that the result is the same: >>> from numpy.linalg import solve, norm >>> x_ = solve(A.todense(), b) # Compute norm of the error: >>> err = norm(x - x_) >>> err < 1e-10 True
Matplotlib
Great plotting package in Python Matlab-like syntax Great rendering: anti-aliasing etc. Many backends: Cairo, GTK, Cocoa, PDF Flexible output: to EPS, PS, PDF, TIFF, PNG, ...
Matplotlib: worked examples
Search the web for 'Matplotlib gallery'
Example: NumPy vectorization
1. Use a Monte Carlo algorithm to estimate : 1. Generate uniform random variates (x,%y) over [0, 1]. 2. Estimate from the proportion p that land in the unit circle. 2. Time two ways of doing this: 1. Using for loops 2. Using array operations (vectorized)
3. Scaling
HPC
High-performance computing
Aspects to HPC
Supercomputers Parallel programming Caches, shared memory Code porting Distributed clusters / grids Scripting Job control Specialized hardware
Python for HPC
Advantages Portability Easy scripting, glue Maintainability Proling to identify hotspots Vectorization with NumPy Disadvantages Global interpreter lock Less control than C Native loops are slow
Large data sets
Useful Python language features: Generators, iterators Useful packages: Great HDF5 support from PyTables!
Hierarchical data
Databases without the relational baggage
Great interface for HDF5 data
Efcient support for massive data sets
Applications of PyTables
aeronautics drug discovery nancial analysis climate prediction telecommunications data mining statistical analysis etc.
Breaking news: June 2011
PyTables Pro is now being open sourced. Indexed searches for speed Merging with PyTables Working project name: NewPyTables
PyTables performance
OPSI indexing engine speed: Querying 10 billion rows can take hundredths of a second! Target use-case: mostly read-only or append-only data
Principles for efcient code
Important principles
1. "Premature optimization is the root of all evil" Don't write cryptic code just to make it more efcient!
2. 1-5% of the code takes up the vast majority of the computing time! ... and it might not be the 1-5% that you think!
Checklist for efcient code
From most to least important: 1. Check: Do you really need to make it more efcient? 2. Check: Are you using the right algorithms and data structures? 3. Check: Are you reusing pre-written libraries wherever possible? 4. Check: Which parts of the code are expensive? Measure, don't guess!
Relative efciency gains
Exponential-order and polynomial-order speedups are possible by choosing the right algorithm for a task. These require the right data structures! These dwarf 10-25x linear-order speedups from: using lower-level languages using different language constructs.
4. About Python Charmers
The largest Python training provider in South-East Asia Delighted customers include:
Most popular course topics
Python for Programmers Python for Scientists and Engineers Python for Geoscientists Python for Bioinformaticians New courses: Python for Financial Engineers Python for IT Security Professionals 4 days 3 days 3 days 4 days 4 days 4 days
Python Charmers: Topics of expertise
Python: beginners, advanced Scientic data processing with Python Software engineering with Python Large-scale problems: HPC, huge data sets, grids Statistics and Monte Carlo problems
Python Charmers: Topics of expertise (2)
Spatial data analysis / GIS General scripting, job control, glue GUIs with PyQt Integrating with other languages: R, C, C++, Fortran, ... Web development in Django
How to get in touch
See PythonCharmers.com or email us at: info@pythoncharmers.com