KEMBAR78
Optimizing Python | PDF
Optimizing
Python
-
FOSDEMx
2018 - 05 - 03
Hello!
I am Eric Gazoni
I’m Senior Python Developer at Adimian
You can find me at @ericgazoni
2
Why optimizing ?
And what to optimize
I/O
Improve read/write speed from network or filesystem
⬗ Data science (large data sets)
⬗ Databases
⬗ Telemetry (IoT)
4
MEMORY
Require less RAM from the system
⬗ Reduce hosting costs
⬗ Run on constrained devices (embedded systems)
⬗ Improve reliability
5
FAULT TOLERANCE / RESILIENCE
Continue operating even with bad or missing input
⬗ Web services
⬗ Medical devices
⬗ Distributed systems
6
CONCURRENCY
Serve more requests at the same time
⬗ Web servers
⬗ IoT controllers
⬗ Database engines
⬗ Web scrapers
7
CPU
Run code more efficiently
⬗ Reduce processing time (reporting, calculation)
⬗ Reduce response time (web pages)
⬗ Reduce energy consumption (and hosting costs)
8
ONLY ONE AT A TIME
⬗ Pick one category
⬗ Hack
⬗ Review
⬗ Rinse, repeat
Optimizing multiple domains at once = unpredictable results
9
General rules of
optimization
Applies to all categories
TARGETS
Define clear targets or get lost in the performance maze
⬗ “This page must load below 200ms”
⬗ “One iteration of this loop must execute below 10ms”
⬗ “This must run on a controller with 8KB memory”
11
METRICS
⬗ You know if you improve or make things worse
◇ You can definitely make things worse !
⬗ You know if you reached your targets
12
3 RULES OF OPTIMIZATION
⬗ Benchmark
⬗ Benchmark
⬗ Benchmark
“Gut feeling” vs Reality
13
“
“Trust, but verify”
Russian proverb
14
IT’S A JUNGLE OUT THERE
15
User land
⬗ Your program
⬗ Implementation of the interpreter (py2/py3/pypy)
⬗ Implementation of the interpreter language standard lib
(C99/C11/…)
IT’S A JUNGLE OUT THERE
16
Operating system
⬗ Implementation of the OS kernel (linux/windows/unix/…)
⬗ Filesystem layout (ext4/NTFS/BTRFS/...)
⬗ Implementation of the hardware drivers (proprietary Nvidia
drivers)
IT’S A JUNGLE OUT THERE
17
Hardware
⬗ CPU architecture (x86/ARM/…)
⬗ CPU extensions (SSE/MMX/…)
⬗ Memory / hard drive technology (spinning/flash/…)
⬗ Temperature (GPU/CPU/RAM/…)
⬗ Network card (Optical/Copper)
SAFETY NETS
⬗ Version control: rewind, pinpoint exactly what you did
⬗ Code coverage: make sure you didn’t break something
18
19
THE DEAD END
⬗ No shame for not succeeding
⬗ Know when to stop and change plans
⬗ There is always more than one tool in the box
20
Optimizer tools
YOUR TOOL BOX
⬗ Profiler
⬗ Profiling analyzer
⬗ timeit
⬗ Improved interpreter (ipython)
⬗ pytest-profiling
22
CAPTURING PROFILE
⬗ Profilers will capture all calls during program execution
⬗ Only capture what you need (reduce noise)
⬗ Stats (or aggregated calls) can be dumped in pstats
binary format
23
PROFILING THE WHOLE PROGRAM
⬗ Will capture a lot of noise
⬗ Not invasive (can run out of any Python script)
$ python -m profile -o output.pstats myscript.py
24
NOTE ON PROFILERS
25
Running code with a profiler is similar to driving with the
parking brake!
Don’t forget to disable it when you are done!
EMBEDDING THE PROFILER
26
Profiling the complete program - importlib sits at the top
27
Profiling only the interesting function
28
ANALYSIS IF THE PROFILE
1. Dump stats into a file
2. Load the file into gprof2dot
3. Use dot (from graphviz package) to generate png/svg
representation
https://github.com/jrfonseca/gprof2dot
29
python -m cProfile -o output.pstats myprogram.py
30
python myprogram.py (with profiler enabled within code)
31
%timeit magic command in ipython (shorthand for timeit module)
32
pytest-profiling
⬗ Useful to run against your unit-tests
⬗ Integrated generation of pstats + svg output
https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling
$ py.test test_cracking.py --profile-svg
33
Statical analysis
The “low hanging fruits”
LOW HANGING FRUITS
⬗ Less intrusive
⬗ Low impact on maintenance
⬗ Usually bring the most significant improvements
E.g: reducing number of calls, removing nested loops
35
EXAMPLE: PASSWORD BRUTE-FORCING
36
⬗ CPU intensive
⬗ Straightforward
This is very bad cryptography, only for demonstration
purpose.
Don’t do this at home !
VOCABULARY
Hash: function that turns a given input in a given output
Brute-force: attempting random inputs in hope to find the one
used initially, by comparing against a known output
Salt: additional factor added to increase the size of the input
37
EXAMPLE
38
39
40
41
42
43
numeric_salts() is called 110x, accounts for ~10% of total time
FINDING INVARIANTS
⬗ If A calls B
⬗ And B does not use any input from A’s scope
⬗ Then B does not vary in function of B
B could be called outside of A without affecting its output
B is invariant
44
45
46
generate_hashes() uses cleartext from the function scope
47
48
numeric_salts() uses salts_space, provided by caller
Extract numeric_salts() call into the main function, only pass result (salts)
49
numeric_salts() is only called once, and is no longer above profiler threshold (~10%)
50
The UNIX time command reports 99% CPU usage, and a total of 7.379 seconds (wall time)
51
Parallel
computing
“[...] an embarrassingly
parallel [...] problem [...]
is one where little or no
effort is needed to
separate the problem into
a number of parallel tasks.
Wikipedia
53
PARALLEL & SEQUENTIAL PROBLEMS
Parallel: if output from B does not depend on output from A
Sequential: if output from B depends on output from A
54
OUR PROBLEM ?
Luckily, password cracking is embarrassingly parallel
55
56
57
pool.apply_async() will execute check_password on different processes (and CPUs)
58
In each process, we repeat the iterative checks for each salt, but for only 1 password
The UNIX time command reports 353% CPU usage, and a total of 4.328 seconds (wall time)
59
CPU USAGE
Single process
Parallel over 4 cores
60
Throwing more
hardware at it
Effective, but often overlooked
BETTER SPECS
CPU speed depends on:
⬗ Pipeline architecture
⬗ Clock speed
⬗ L2 cache
Non-parallel problems only need faster CPU clocks
62
PARALLEL + MORE CPUs = WIN
For parallel problems:
⬗ Add CPUs
⬗ Add more computers with more CPUs
◇ Need to think about networking, queues, failover, …
http://www.celeryproject.org/
63
High performance
libraries
Not reinventing the wheel
UNDERSTANDING VECTORS
The iterative sum
⬗ Row after row
⬗ Each line can be different
65
The vectorized sum
⬗ Data is typed
⬗ Homogenous dataset
⬗ Optimized operations on rows
and columns
NUMPY
⬗ Centered around ndarray
⬗ Homogenous type (if possible)
⬗ Non-sparse arrays (shape = rows * columns)
⬗ Close to C / Fortran API
⬗ Efficient numerical operations
⬗ Good integration with Cython
http://www.numpy.org/
66
PANDAS
⬗ Heavily based on NumPy
⬗ Serie, DataFrame, Index
⬗ Batteries included:
◇ Integrations for reading/writing different formats
◇ Date/datetime/timezone handling
⬗ More user-friendly than NumPy
https://pandas.pydata.org/
67
Counting passwords containing the word “eric” in pure Python
68
Pure Python solution finds 16681 matches in 23 seconds
69
Pandas version - No explicit loop
70
Pandas finds 16625 matches in 19 seconds
71
Cython
Reinventing the wheel
WHY NOT JUST WRITE C ?
⬗ Write C code
⬗ Compile C code
⬗ Use CFFI or ctypes to load and call code
⬗ In “C land”
◇ Untangle PyObject yourself
◇ No exception mechanism
73
CYTHON
⬗ Precompile Python code in C
⬗ Automatically links and wraps the code so it can be
imported
⬗ Seamless transition between “C” and “Python” contexts
◇ Exceptions
◇ print()
◇ PyObject untangling
74
Regular Python code
75
C-typing variables
76
C-typing function
77
Cython annotate - White = C / Yellow = Python
78
PACKAGING & DISTRIBUTION
79
PyPy
Just in time to save the day
WHAT IS JIT OPTIMIZATION
CPython compiler optimize bytecode on guessed processing
What if the compiler could optimize for actual processing ?
Just In Time optimization monitors how the code is running
and suggest bytecode optimizations on the fly
81
PYPY
⬗ Alternative Python implementation
◇ 100% compatible with Python 2.7 & 3.5
◇ not 100% compatible with (some) C libraries
⬗ Automatically rewrites internal logic for performance
⬗ Needs lots of data to make better decisions
http://pypy.org/
82
Create 5 million “messages”, count them and check the last one
83
CPython: 20.4 seconds vs PyPy: 6.6 seconds
84
JIT counter example - CPython is faster for 500 messages
85
JIT PROs & CONs
Pros:
⬗ Works on existing codebase
⬗ Ridiculously fast
⬗ Support for NumPy (not yet for
Pandas)
Cons:
⬗ No support for pandas
⬗ Another interpreter
⬗ Works best with pure-Python
types
⬗ Needs “warm-up”
86
YOU CAN’T HAVE IT ALL
Optimization is always a trade-off with maintainability
87
Summary
Summary
⬗ Wide deployment
⬗ “Simple” codebase
1. Low hanging fruits
2. Vectors
3. Better hardware
⬗ Sequential code
⬗ Limited deployment
1. Better hardware
2. PyPy
3. Cython
⬗ Embarrassingly
parallel code
1. Worker threads
2. Worker processes
3. Throw more CPUs
89
90
Thanks!
Any questions?
You can find me at @ericgazoni & eric@adimian.com
Credits
Special thanks to all the people who made and released these
awesome resources for free:
⬗ Presentation template by SlidesCarnival
⬗ Photographs by Unsplash
91

Optimizing Python

  • 1.
  • 2.
    Hello! I am EricGazoni I’m Senior Python Developer at Adimian You can find me at @ericgazoni 2
  • 3.
    Why optimizing ? Andwhat to optimize
  • 4.
    I/O Improve read/write speedfrom network or filesystem ⬗ Data science (large data sets) ⬗ Databases ⬗ Telemetry (IoT) 4
  • 5.
    MEMORY Require less RAMfrom the system ⬗ Reduce hosting costs ⬗ Run on constrained devices (embedded systems) ⬗ Improve reliability 5
  • 6.
    FAULT TOLERANCE /RESILIENCE Continue operating even with bad or missing input ⬗ Web services ⬗ Medical devices ⬗ Distributed systems 6
  • 7.
    CONCURRENCY Serve more requestsat the same time ⬗ Web servers ⬗ IoT controllers ⬗ Database engines ⬗ Web scrapers 7
  • 8.
    CPU Run code moreefficiently ⬗ Reduce processing time (reporting, calculation) ⬗ Reduce response time (web pages) ⬗ Reduce energy consumption (and hosting costs) 8
  • 9.
    ONLY ONE ATA TIME ⬗ Pick one category ⬗ Hack ⬗ Review ⬗ Rinse, repeat Optimizing multiple domains at once = unpredictable results 9
  • 10.
  • 11.
    TARGETS Define clear targetsor get lost in the performance maze ⬗ “This page must load below 200ms” ⬗ “One iteration of this loop must execute below 10ms” ⬗ “This must run on a controller with 8KB memory” 11
  • 12.
    METRICS ⬗ You knowif you improve or make things worse ◇ You can definitely make things worse ! ⬗ You know if you reached your targets 12
  • 13.
    3 RULES OFOPTIMIZATION ⬗ Benchmark ⬗ Benchmark ⬗ Benchmark “Gut feeling” vs Reality 13
  • 14.
  • 15.
    IT’S A JUNGLEOUT THERE 15 User land ⬗ Your program ⬗ Implementation of the interpreter (py2/py3/pypy) ⬗ Implementation of the interpreter language standard lib (C99/C11/…)
  • 16.
    IT’S A JUNGLEOUT THERE 16 Operating system ⬗ Implementation of the OS kernel (linux/windows/unix/…) ⬗ Filesystem layout (ext4/NTFS/BTRFS/...) ⬗ Implementation of the hardware drivers (proprietary Nvidia drivers)
  • 17.
    IT’S A JUNGLEOUT THERE 17 Hardware ⬗ CPU architecture (x86/ARM/…) ⬗ CPU extensions (SSE/MMX/…) ⬗ Memory / hard drive technology (spinning/flash/…) ⬗ Temperature (GPU/CPU/RAM/…) ⬗ Network card (Optical/Copper)
  • 18.
    SAFETY NETS ⬗ Versioncontrol: rewind, pinpoint exactly what you did ⬗ Code coverage: make sure you didn’t break something 18
  • 19.
  • 20.
    THE DEAD END ⬗No shame for not succeeding ⬗ Know when to stop and change plans ⬗ There is always more than one tool in the box 20
  • 21.
  • 22.
    YOUR TOOL BOX ⬗Profiler ⬗ Profiling analyzer ⬗ timeit ⬗ Improved interpreter (ipython) ⬗ pytest-profiling 22
  • 23.
    CAPTURING PROFILE ⬗ Profilerswill capture all calls during program execution ⬗ Only capture what you need (reduce noise) ⬗ Stats (or aggregated calls) can be dumped in pstats binary format 23
  • 24.
    PROFILING THE WHOLEPROGRAM ⬗ Will capture a lot of noise ⬗ Not invasive (can run out of any Python script) $ python -m profile -o output.pstats myscript.py 24
  • 25.
    NOTE ON PROFILERS 25 Runningcode with a profiler is similar to driving with the parking brake! Don’t forget to disable it when you are done!
  • 26.
  • 27.
    Profiling the completeprogram - importlib sits at the top 27
  • 28.
    Profiling only theinteresting function 28
  • 29.
    ANALYSIS IF THEPROFILE 1. Dump stats into a file 2. Load the file into gprof2dot 3. Use dot (from graphviz package) to generate png/svg representation https://github.com/jrfonseca/gprof2dot 29
  • 30.
    python -m cProfile-o output.pstats myprogram.py 30
  • 31.
    python myprogram.py (withprofiler enabled within code) 31
  • 32.
    %timeit magic commandin ipython (shorthand for timeit module) 32
  • 33.
    pytest-profiling ⬗ Useful torun against your unit-tests ⬗ Integrated generation of pstats + svg output https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling $ py.test test_cracking.py --profile-svg 33
  • 34.
  • 35.
    LOW HANGING FRUITS ⬗Less intrusive ⬗ Low impact on maintenance ⬗ Usually bring the most significant improvements E.g: reducing number of calls, removing nested loops 35
  • 36.
    EXAMPLE: PASSWORD BRUTE-FORCING 36 ⬗CPU intensive ⬗ Straightforward This is very bad cryptography, only for demonstration purpose. Don’t do this at home !
  • 37.
    VOCABULARY Hash: function thatturns a given input in a given output Brute-force: attempting random inputs in hope to find the one used initially, by comparing against a known output Salt: additional factor added to increase the size of the input 37
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    43 numeric_salts() is called110x, accounts for ~10% of total time
  • 44.
    FINDING INVARIANTS ⬗ IfA calls B ⬗ And B does not use any input from A’s scope ⬗ Then B does not vary in function of B B could be called outside of A without affecting its output B is invariant 44
  • 45.
  • 46.
    46 generate_hashes() uses cleartextfrom the function scope
  • 47.
  • 48.
  • 49.
    Extract numeric_salts() callinto the main function, only pass result (salts) 49
  • 50.
    numeric_salts() is onlycalled once, and is no longer above profiler threshold (~10%) 50
  • 51.
    The UNIX timecommand reports 99% CPU usage, and a total of 7.379 seconds (wall time) 51
  • 52.
  • 53.
    “[...] an embarrassingly parallel[...] problem [...] is one where little or no effort is needed to separate the problem into a number of parallel tasks. Wikipedia 53
  • 54.
    PARALLEL & SEQUENTIALPROBLEMS Parallel: if output from B does not depend on output from A Sequential: if output from B depends on output from A 54
  • 55.
    OUR PROBLEM ? Luckily,password cracking is embarrassingly parallel 55
  • 56.
  • 57.
    57 pool.apply_async() will executecheck_password on different processes (and CPUs)
  • 58.
    58 In each process,we repeat the iterative checks for each salt, but for only 1 password
  • 59.
    The UNIX timecommand reports 353% CPU usage, and a total of 4.328 seconds (wall time) 59
  • 60.
  • 61.
    Throwing more hardware atit Effective, but often overlooked
  • 62.
    BETTER SPECS CPU speeddepends on: ⬗ Pipeline architecture ⬗ Clock speed ⬗ L2 cache Non-parallel problems only need faster CPU clocks 62
  • 63.
    PARALLEL + MORECPUs = WIN For parallel problems: ⬗ Add CPUs ⬗ Add more computers with more CPUs ◇ Need to think about networking, queues, failover, … http://www.celeryproject.org/ 63
  • 64.
  • 65.
    UNDERSTANDING VECTORS The iterativesum ⬗ Row after row ⬗ Each line can be different 65 The vectorized sum ⬗ Data is typed ⬗ Homogenous dataset ⬗ Optimized operations on rows and columns
  • 66.
    NUMPY ⬗ Centered aroundndarray ⬗ Homogenous type (if possible) ⬗ Non-sparse arrays (shape = rows * columns) ⬗ Close to C / Fortran API ⬗ Efficient numerical operations ⬗ Good integration with Cython http://www.numpy.org/ 66
  • 67.
    PANDAS ⬗ Heavily basedon NumPy ⬗ Serie, DataFrame, Index ⬗ Batteries included: ◇ Integrations for reading/writing different formats ◇ Date/datetime/timezone handling ⬗ More user-friendly than NumPy https://pandas.pydata.org/ 67
  • 68.
    Counting passwords containingthe word “eric” in pure Python 68
  • 69.
    Pure Python solutionfinds 16681 matches in 23 seconds 69
  • 70.
    Pandas version -No explicit loop 70
  • 71.
    Pandas finds 16625matches in 19 seconds 71
  • 72.
  • 73.
    WHY NOT JUSTWRITE C ? ⬗ Write C code ⬗ Compile C code ⬗ Use CFFI or ctypes to load and call code ⬗ In “C land” ◇ Untangle PyObject yourself ◇ No exception mechanism 73
  • 74.
    CYTHON ⬗ Precompile Pythoncode in C ⬗ Automatically links and wraps the code so it can be imported ⬗ Seamless transition between “C” and “Python” contexts ◇ Exceptions ◇ print() ◇ PyObject untangling 74
  • 75.
  • 76.
  • 77.
  • 78.
    Cython annotate -White = C / Yellow = Python 78
  • 79.
  • 80.
    PyPy Just in timeto save the day
  • 81.
    WHAT IS JITOPTIMIZATION CPython compiler optimize bytecode on guessed processing What if the compiler could optimize for actual processing ? Just In Time optimization monitors how the code is running and suggest bytecode optimizations on the fly 81
  • 82.
    PYPY ⬗ Alternative Pythonimplementation ◇ 100% compatible with Python 2.7 & 3.5 ◇ not 100% compatible with (some) C libraries ⬗ Automatically rewrites internal logic for performance ⬗ Needs lots of data to make better decisions http://pypy.org/ 82
  • 83.
    Create 5 million“messages”, count them and check the last one 83
  • 84.
    CPython: 20.4 secondsvs PyPy: 6.6 seconds 84
  • 85.
    JIT counter example- CPython is faster for 500 messages 85
  • 86.
    JIT PROs &CONs Pros: ⬗ Works on existing codebase ⬗ Ridiculously fast ⬗ Support for NumPy (not yet for Pandas) Cons: ⬗ No support for pandas ⬗ Another interpreter ⬗ Works best with pure-Python types ⬗ Needs “warm-up” 86
  • 87.
    YOU CAN’T HAVEIT ALL Optimization is always a trade-off with maintainability 87
  • 88.
  • 89.
    Summary ⬗ Wide deployment ⬗“Simple” codebase 1. Low hanging fruits 2. Vectors 3. Better hardware ⬗ Sequential code ⬗ Limited deployment 1. Better hardware 2. PyPy 3. Cython ⬗ Embarrassingly parallel code 1. Worker threads 2. Worker processes 3. Throw more CPUs 89
  • 90.
    90 Thanks! Any questions? You canfind me at @ericgazoni & eric@adimian.com
  • 91.
    Credits Special thanks toall the people who made and released these awesome resources for free: ⬗ Presentation template by SlidesCarnival ⬗ Photographs by Unsplash 91