Writing Fast Haskell
Elegance Is Not an Excuse for Bad Performance
Moritz Kiefer (@cocreature)
August 22, 2018
Introduction
• Haskellers often talk about elegant code
1
Introduction
• Haskellers often talk about elegant code
• But elegance is not an excuse for bad performance!
1
Introduction
• Haskellers often talk about elegant code
• But elegance is not an excuse for bad performance!
• Writing fast Haskell requires some understanding of GHC’s
internals
1
Introduction
• Haskellers often talk about elegant code
• But elegance is not an excuse for bad performance!
• Writing fast Haskell requires some understanding of GHC’s
internals
• GHC provides a surprising number of tools to influence
performance
1
Goals for Today
1. Learn to reason about performance
2. Look under the hood of GHC (specifically Core)
3. Learn about some rules of thumb for writing fast Haskell
4. Learn about primitives and libraries useful for writing fast
Haskell
2
Benchmarking/Profiling Disclaimer
• Benchmark before you optimize
• GHC supports options for time and space profiling
• Profiling can break optimizations
• Enable profiling selectively
3
Primitive, Unlifted and Boxed Types
Primitive Types
Correspond to “raw machine types”
E.g. Int#, Double#
4
Primitive, Unlifted and Boxed Types
Primitive Types
Correspond to “raw machine types”
E.g. Int#, Double#
Boxed Types
Represented by a pointer to a heap object
E.g. all user-defined types, Int
4
Primitive, Unlifted and Boxed Types
Primitive Types
Correspond to “raw machine types”
E.g. Int#, Double#
Boxed Types
Represented by a pointer to a heap object
E.g. all user-defined types, Int
Unlifted Types
Cannot be bottom
E.g. all primitive types but also Array# (which is not
primitive)
4
Definition of the Int type
data Int = I# Int#
Int
I# Int#
5
GHC Compilation Pipeline
Haskell
Core
STG
C--
Assembly LLVM IR
6
GHC Compilation Pipeline
Haskell Renaming, type checking, desugaring
Core
STG
C--
Assembly LLVM IR
6
GHC Compilation Pipeline
Haskell Renaming, type checking, desugaring
Core Simplifier
STG
C--
Assembly LLVM IR
6
GHC Compilation Pipeline
Haskell Renaming, type checking, desugaring
Core Simplifier
STG Canonicalized core without types
C--
Assembly LLVM IR
6
Core’s Expr Type
data Expr b
= Var Id
| Lit Literal
| App (Expr b) (Arg b)
| Lam b (Expr b)
| Let (Bind b) (Expr b)
| Case (Expr b) b Type [Alt b]
| Cast (Expr b) Coercion
| Tick (Tickish Id) (Expr b)
| Type Type
| Coercion Coercion
deriving Data
7
Viewing Core
• -ddump-simpl or -ddump-prep
• Suppress info that you don’t care about
• -dsuppress-idinfo
• -dsuppress-ticks
• -dsuppress-module-prefixes
• -dsuppress-all
• GHC plugin that outputs core as HTML
8
Mental Model for Core
let
Allocates a thunk on the heap
case
Forces evaluation to WHNF
9
Naive Sum
sum :: [Int] -> Int
sum [] = 0
sum (x : xs) = x + sum xs
10
Tail-Recursive Sum
sum :: [Int] -> Int
sum = go 0
where
go acc [] = acc
go acc (x : xs) = go (x + acc) xs
11
Force the Accumulator
sum :: [Int] -> Int
sum = go 0
where
go acc [] = acc
go acc (x : xs) =
let acc' = x + acc
in acc' `seq` go acc' xs
12
BangPatterns
{-# LANGUAGE BangPatterns #-}
sum :: [Int] -> Int
sum = go 0
where
go acc [] = acc
go acc (x : xs) =
let !acc' = x + acc
in go acc' xs
13
WHNF
• seq only evaluates to WHNF
• Be careful with tuples!
(x,y) `seq` … will not evaluate x and y
• Use the deepseq lib for evaluating to NF
14
Strictness Annotations in Data Types
data Point = Point !Int !Int
Whenever you evaluate Point to WHNF, you also evaluate the
two fields to WHNF.
Often easier to use than seq/BangPatterns
15
Avoiding Space Leaks
Rule of Thumb
Constant-size (e.g. Int) accumulators are often problematic
Detecting Space Leaks
• Limit the stack size +RTS -K${n}K
• Get a stacktrace with +RTS -xc -K${n}K
16
Specialization and Inlining
Specialization
• Specialize type parameters
• Remove type class dictionaries
Inlining
• Inline definition at call site
17
Cross-Module Specialization and Inlining
• Specialization/Inlining only possible if definition
(=unfolding) is available
• Unfoldings of small definitions are automatically exposed
• {-# INLINABLE f #-} forces GHC to expose f’s
unfolding
• You might also want to expose unfoldings of definitions
used by f
18
Specialization
• GHC will automatically try to specialize definitions at
use-sites
• Create specializations using
{-# SPECIALIZE f :: Int -> Int #-}
• Also creates specializations of functions called by f
19
Inlining
• {-# INLINE f #-} makes GHC very eager to inline f
• Use cautiously!
• Can blow up compile times significantly
• Can increase code size without bringing benefits
20
Inlining
• {-# INLINE f #-} makes GHC very eager to inline f
• Use cautiously!
• Can blow up compile times significantly
• Can increase code size without bringing benefits
• Note: {-# INLINABLE f #-} does not make GHC more
eager to inline f
20
Inlining
The following two definitions are equivalent.
f a b = …
f = \a b …
21
Inlining
Or are they?
f a b = …
f = \a b …
21
Call Arity
GHC will only inline
fully saturated function applications!
21
Controlling Memory Layout
data Point =
Point Int
Int
Point
Point Int Int
Int Int
I# Int# I# Int#
22
Controlling Memory Layout
data Point =
Point {-# UNPACK #-} !Int
{-# UNPACK #-} !Int
Point
Point Int# Int#
23
Automatic Unpacking
• GHC is quite good at automatic unpacking
• But only if it can detect that an argument is strict
• Sometimes you need to help it
f :: Vector Int -> …
f xs = …
where n = Vector.length xs
24
Automatic Unpacking
• GHC is quite good at automatic unpacking
• But only if it can detect that an argument is strict
• Sometimes you need to help it
f :: Vector Int -> …
f xs = …
where !n = Vector.length xs
24
Continuation Passing Style
main :: IO ()
main =
case loop2 100 (10, 10) of
(x, y) -> print (x - y)
loop2 :: Int -> (Int,Int) -> (Int,Int)
loop2 n (x, y)
| n > 0 = loop2 (n - 1) (x + 1, y - 1)
| otherwise = (x, y)
25
Continuation Passing Style
Convert
f :: a -> b
into
f :: a -> (b -> r) -> r
Can avoid allocations and unnecessary case distinctions
26
Continuation Passing Style
main :: IO ()
main =
loop2 100 (10, 10) $ \(x, y) -> print (x - y)
loop2 :: Int -> (Int,Int) -> ((Int,Int) -> r) -> r
loop2 n (x, y) cont
| n > 0 = loop2 (n - 1) (x + 1,y - 1) cont
| otherwise = cont (x, y)
27
Writing Your Own Optimizations: Rewrite Rules
map Fusion
map f . map g = map (f . g)
28
Writing Your Own Optimizations: Rewrite Rules
map Fusion
map f . map g = map (f . g)
build/foldr Fusion
build
:: (forall b. (a -> b -> b) -> b -> b)
-> [a]
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f a (build g) = g f a
28
Writing your Own Optimizations: Rewrite Rules
Example
{-# RULES
"map/map"
forall f g xs. map f (map g xs) =
map (f.g) xs
#-}
• GHC does not check correctness of rules
• GHC does not check termination of rules
• Use phases to control interaction of rules and inlining
29
Array Primitives in GHC
Array#
• Array of boxed values
• Card table to avoid having to scan unmodified entries in
GC
SmallArray#
• Array of boxed values
• No card table
ByteArray#
• Region of raw memory
• Pinned and unpinned
30
High-Level Array Libraries
• primitive provides PrimArray wrapper around
ByteArray#
• vector provides boxed, unboxed and Storable vectors
• Fusion
• Slicing
31
Basic Data Structures
• containers is mostly pretty good!
• Use the specialized data structures for Int: IntSet and
IntMap
• unordered-containers has a fast, persistent HashMap
• Mutable hashtables from the hashtables package are
often slower
32
Conclusion
• GHC is impressively good at optimizing high-level code
• Reasoning about performance isn’t trivial but definitely
possible
• GHC gives us the tools to control specific aspects of our
programs
33
Conclusion
• GHC is impressively good at optimizing high-level code
• Reasoning about performance isn’t trivial but definitely
possible
• GHC gives us the tools to control specific aspects of our
programs
• If all else fails, GHC has a great C FFI
33
More Information
• The Spineless Tagless G-machine
• Detecting Space Leaks
• Inlining and Specialisation
• GHC User’s Guide
• The GHC Commentary
34