Programming Languages For LHC
Programming Languages For LHC
Jim Pivarski
May 8, 2019
1 / 57
2 / 57
But that just ended a few minutes ago.
3 / 57
But that just ended a few minutes ago.
3 / 57
4 / 57
Because, you know, it’s different water.
5 / 57
So why do we say it’s the same river?
5 / 57
Why do we say it’s the same river?
6 / 57
Why do we say it’s the same river?
6 / 57
Most of computer science is about abstracting details, too.
double bessel_j0(double x) {
double out;
if (fabs(x) < 8.0) {
double y = x*x;
double ans1 = 57568490574.0 + y*(-13362590354.0 + y*(651619640.7
+ y*(-11214424.18 + y*(77392.33017 + y*(-184.9052456)))));
double ans2 = 57568490411.0 + y*(1029532985.0 + y*(9494680.718
+ y*(59272.64853 + y*(267.8532712 + y*1.0))));
out = ans1 / ans2;
}
else {
double z = 8.0 / fabs(x);
double y = z*z;
double xx = fabs(x) - 0.785398164;
double ans1 = 1.0 + y*(-0.1098628627e-2 + y*(0.2734510407e-4
+ y*(-0.2073370639e-5 + y*0.2093887211e-6)));
double ans2 = -0.1562499995e-1 + y*(0.1430488765e-3
+ y*(-0.6911147651e-5 + y*(0.7621095161e-6
- y*0.934935152e-7)));
out = sqrt(0.636619772/fabs(x))*(cos(xx)*ans1 - z*sin(xx)*ans2);
}
return out;
} 7 / 57
Most of computer science is about abstracting details, too.
double bessel_j0(double x) {
double out;
← one value goes in
if (fabs(x) < 8.0) {
double y = x*x;
double ans1 = 57568490574.0 + y*(-13362590354.0 + y*(651619640.7
+ y*(-11214424.18 + y*(77392.33017 + y*(-184.9052456)))));
double ans2 = 57568490411.0 + y*(1029532985.0 + y*(9494680.718
+ y*(59272.64853 + y*(267.8532712 + y*1.0))));
out = ans1 / ans2;
}
else {
double z = 8.0 / fabs(x);
double y = z*z;
double xx = fabs(x) - 0.785398164;
double ans1 = 1.0 + y*(-0.1098628627e-2 + y*(0.2734510407e-4
+ y*(-0.2073370639e-5 + y*0.2093887211e-6)));
double ans2 = -0.1562499995e-1 + y*(0.1430488765e-3
+ y*(-0.6911147651e-5 + y*(0.7621095161e-6
- y*0.934935152e-7)));
out = sqrt(0.636619772/fabs(x))*(cos(xx)*ans1 - z*sin(xx)*ans2);
}
return out;
} 7 / 57
Most of computer science is about abstracting details, too.
double bessel_j0(double x) {
double out;
← one value goes in
if (fabs(x) < 8.0) {
double y = x*x;
double ans1 = 57568490574.0 + y*(-13362590354.0 + y*(651619640.7
+ y*(-11214424.18 + y*(77392.33017 + y*(-184.9052456)))));
double ans2 = 57568490411.0 + y*(1029532985.0 + y*(9494680.718
+ y*(59272.64853 + y*(267.8532712 + y*1.0))));
out = ans1 / ans2;
}
else {
double z = 8.0 / fabs(x);
double y = z*z;
double xx = fabs(x) - 0.785398164;
double ans1 = 1.0 + y*(-0.1098628627e-2 + y*(0.2734510407e-4
+ y*(-0.2073370639e-5 + y*0.2093887211e-6)));
double ans2 = -0.1562499995e-1 + y*(0.1430488765e-3
+ y*(-0.6911147651e-5 + y*(0.7621095161e-6
- y*0.934935152e-7)));
out = sqrt(0.636619772/fabs(x))*(cos(xx)*ans1 - z*sin(xx)*ans2);
}
}
return out; ← one value comes out
7 / 57
The abstraction is cumulative:
Every function/class/module has an
interior and an interface—minimizing
#external parameters
#internal parameters
reduces the mental burden on
programmers and users.
8 / 57
Science has layers of abstraction
9 / 57
(cartoon diagram, not to scale)
computer
#external parameters
programming
abstraction in science
(atom → proton → quark)
machine
learning thermodynamics
#internal parameters
10 / 57
Software interfaces can be exact, despite radical internal differences.
I Super Mario Bros. entirely rewritten in Javascript by Josh Goldberg.
I Shares none of the original code, but behaves identically.
11 / 57
Software interfaces can be exact, despite radical internal differences.
I Super Mario Bros. entirely rewritten in Javascript by Josh Goldberg.
I Shares none of the original code, but behaves identically.
12 / 57
As a young programmer, I wasn’t satisfied
with high-level languages because I wanted
to get down to the “real” computer.
12 / 57
As a young programmer, I wasn’t satisfied
with high-level languages because I wanted
to get down to the “real” computer.
13 / 57
The objectively real part of a computer is a set
of physical states that we interpret as computations.
13 / 57
Programming languages are how we describe our interpretations.
XIX + IV = XXIII
19 + 4 = 23
14 / 57
Programming languages are how we describe our interpretations.
XIX + IV = XXIII
19 + 4 = 23
15 / 57
Programming languages differ in their degree of abstraction,
but all programming languages are for humans, not computers.
15 / 57
Programming languages differ in their degree of abstraction,
but all programming languages are for humans, not computers.
16 / 57
Originally, programming languages didn’t push the abacus beads.
John McCarthy, creator of Lisp: “This EVAL was written and published in the paper and
Steve Russel said, ‘Look, why don’t I program this EVAL?’ and I said to him, ‘Ho, ho,
you’re confusing theory with practice—this EVAL is intended for reading, not for
computing!’ But he went ahead and did it.”
16 / 57
Originally, programming languages didn’t push the abacus beads.
John McCarthy, creator of Lisp: “This EVAL was written and published in the paper and
Steve Russel said, ‘Look, why don’t I program this EVAL?’ and I said to him, ‘Ho, ho,
you’re confusing theory with practice—this EVAL is intended for reading, not for
computing!’ But he went ahead and did it.”
APL (ancestor of MATLAB, R, and Numpy) was also a notation for describing programs
years before it was executable. The book was named A Programming Language.
16 / 57
Programmers had to manually translate
these notations into instruction codes.
17 / 57
Programmers had to manually translate
these notations into instruction codes.
17 / 57
The Software Crisis
Now that our programming languages do push abacus beads, software engineering
has become an odd discipline: saying something is the same as making it.
18 / 57
The Software Crisis
Now that our programming languages do push abacus beads, software engineering
has become an odd discipline: saying something is the same as making it.
18 / 57
We favor high-level languages because they have fewer concepts,
hopefully just the ones that are essential for a problem.
19 / 57
We favor high-level languages because they have fewer concepts,
hopefully just the ones that are essential for a problem.
19 / 57
We favor high-level languages because they have fewer concepts,
hopefully just the ones that are essential for a problem.
19 / 57
Except Python. Python is slow, right?
https://benchmarksgame-team.pages.debian.net/benchmarksgame
20 / 57
But it really isn’t the language; it’s the implementation.
import numpy
return fractal
21 / 57
But it really isn’t the language; it’s the implementation.
return fractal
22 / 57
Here’s the catch
22 / 57
Here’s the catch
Pure Python is slower than Numba or C because it has more hurdles in the way:
dynamic typing, pointer-chasing, garbage collection, hashtables, string equality. . .
22 / 57
Greg Owen’s talk on Spark 2.0
23 / 57
Greg Owen’s talk on Spark 2.0
23 / 57
Greg Owen’s talk on Spark 2.0
23 / 57
Greg Owen’s talk on Spark 2.0
23 / 57
Greg Owen’s talk on Spark 2.0
23 / 57
So although it’s the implementation, not the language, that’s slow,
24 / 57
25 / 57
Domain-specific languages:
specialized languages for narrowly defined problems.
26 / 57
Domain-specific languages that you’re probably already using
Any guesses?
27 / 57
Domain-specific languages that you’re probably already using
Regular expressions
28 / 57
Domain-specific languages that you’re probably already using
100
TTree::Draw (TTreeFormula) 80
60
ttree->Draw("lep1_p4.X() + lep1_p4.Y()"); 40
20
0
-400 -300 -200 -100 0 100 200 300
lep1_p4.X() + lep1_p4.Y()
29 / 57
Domain-specific languages that you’re probably already using
100
TTree::Draw (TTreeFormula) 80
60
ttree->Draw("lep1_p4.X() + lep1_p4.Y()"); 40
20
0
Looping and reducing constructs: -400 -300 -200 -100 0 100 200
lep1_p4.X() + lep1_p4.Y()
300
Makefiles
30 / 57
Domain-specific languages that you’re probably already using
Format strings
31 / 57
Domain-specific languages that you’re probably already using
Format strings
31 / 57
External versus internal (embedded) domain-specific languages
External: SQL has a distinct syntax from Python; must be quoted in PySpark.
import pyspark
pyspark.sql("""
SELECT CONCAT(first, " ", last) AS fullname, AVG(age)
FROM my_table WHERE age BETWEEN 18 AND 24
GROUP BY fullname
""")
33 / 57
Objection: a collection of libraries and operator overloads isn’t a language!
33 / 57
Objection: a collection of libraries and operator overloads isn’t a language!
(One might as well argue about the distinction between languages and dialects.)
33 / 57
Perhaps the most widespread domain-specific language in data analysis:
SQL
34 / 57
Perhaps the most widespread domain-specific language in data analysis:
SQL
But we rarely use it in particle physics. Why?
34 / 57
Structure of a collider physics query: C++
“Momentum of the track with |η| < 2.4 that has the most hits.”
Track *best = NULL;
if (best != NULL)
return best->pt;
else
return 0.0;
35 / 57
Structure of a collider physics query: SQL
“Momentum of the track with |η| < 2.4 that has the most hits.”
WITH hit_stats AS (
SELECT hit.track_id, COUNT(*) AS hit_count FROM hit
GROUP BY hit.track_id),
track_sorted AS (
SELECT track.*,
ROW_NUMBER() OVER (
PARTITION BY track.event_id
ORDER BY hit_stats.hit_count DESC)
track_ordinal FROM track INNER JOIN hit_stats
ON hit_stats.track_id = track.id
WHERE ABS(track.eta) < 2.4)
SELECT * FROM event INNER JOIN track_sorted
ON track_sorted.event_id = event.id
WHERE
track_sorted.track_ordinal = 1
35 / 57
The problem is that collisions produce a variable number of particles per event:
the tables are “jagged.”
36 / 57
The problem is that collisions produce a variable number of particles per event:
the tables are “jagged.”
36 / 57
The problem is that collisions produce a variable number of particles per event:
the tables are “jagged.”
SQL makes particle physics problems harder, not easier, which defeats the point.
36 / 57
It seems like there’s an opportunity here
37 / 57
It seems like there’s an opportunity here
38 / 57
In fact, about that SQL. . .
38 / 57
Why hasn’t this been done before?
39 / 57
I think the answer is cultural, so I’ll take a historical perspective. . .
40 / 57
I think the answer is cultural, so I’ll take a historical perspective. . .
Starting in 1880.
40 / 57
The U.S. Census’s problem
The U.S. does a census every 10 years. The 1880 census took 8 years to process.
−→ Big data problem!
41 / 57
The U.S. Census’s problem
The U.S. does a census every 10 years. The 1880 census took 8 years to process.
−→ Big data problem!
Held a competition for a new method; winner was 10× faster than the rest:
41 / 57
Census records on punch cards, which filtered electrical contacts
42 / 57
Wired to a machine that opens a door for each matching pattern
43 / 57
It was an SQL machine: 3 basic clauses of most SQL queries
SELECT: pre-programmed WHERE: pins pass through
(wired up) counters punch card and template
45 / 57
Herman Hollerith (inventor) incorporated the Tabulating Machine Company, which
after a series of mergers became International Business Machines (IBM) in 1924.
45 / 57
Herman Hollerith (inventor) incorporated the Tabulating Machine Company, which
after a series of mergers became International Business Machines (IBM) in 1924.
45 / 57
Google’s problem
In the early 2000’s, Google was struggling to keep up with the growing web
(index 5 months out of date, routine hardware failures, scale sensitive to bit flips).
46 / 57
Google’s problem
In the early 2000’s, Google was struggling to keep up with the growing web
(index 5 months out of date, routine hardware failures, scale sensitive to bit flips).
At that time, each programmer had to divide tasks, distribute, and combine
results manually and account for failures manually.
46 / 57
Google’s problem
In the early 2000’s, Google was struggling to keep up with the growing web
(index 5 months out of date, routine hardware failures, scale sensitive to bit flips).
At that time, each programmer had to divide tasks, distribute, and combine
results manually and account for failures manually.
46 / 57
Google’s problem
In the early 2000’s, Google was struggling to keep up with the growing web
(index 5 months out of date, routine hardware failures, scale sensitive to bit flips).
At that time, each programmer had to divide tasks, distribute, and combine
results manually and account for failures manually.
46 / 57
Google’s problem
In the early 2000’s, Google was struggling to keep up with the growing web
(index 5 months out of date, routine hardware failures, scale sensitive to bit flips).
At that time, each programmer had to divide tasks, distribute, and combine
results manually and account for failures manually.
47 / 57
That’s how statisticians encountered computing.
48 / 57
Physicists got into computers when they became general-purpose
49 / 57
Physicists got into computers when they became general-purpose
49 / 57
Physicists got into computers when they became general-purpose
49 / 57
Physicists got into computers when they became general-purpose
49 / 57
Physicists got into computers when they became general-purpose
50 / 57
Eckert-Mauchly Computer Corporation → Remington Rand
51 / 57
Eckert-Mauchly Computer Corporation → Remington Rand
51 / 57
Eckert-Mauchly Computer Corporation → Remington Rand
51 / 57
Eckert-Mauchly Computer Corporation → Remington Rand
52 / 57
Physicists drove programming language development in the 1940’s and 1950’s but
stuck with FORTRAN until the 21st century.
In fact, FORTRAN (pre-Fortran 90) wasn’t even a good fit to data analysis problems.
It didn’t handle jagged data well, much like SQL.
52 / 57
Physicists drove programming language development in the 1940’s and 1950’s but
stuck with FORTRAN until the 21st century.
In fact, FORTRAN (pre-Fortran 90) wasn’t even a good fit to data analysis problems.
It didn’t handle jagged data well, much like SQL.
This gap was filled with a library: ZEBRA provided a graph of structures and dynamic
memory management, even though these were features of Ada, C, Pascal, and PL/I.
52 / 57
Physicists drove programming language development in the 1940’s and 1950’s but
stuck with FORTRAN until the 21st century.
In fact, FORTRAN (pre-Fortran 90) wasn’t even a good fit to data analysis problems.
It didn’t handle jagged data well, much like SQL.
This gap was filled with a library: ZEBRA provided a graph of structures and dynamic
memory management, even though these were features of Ada, C, Pascal, and PL/I.
52 / 57
Ironically, a very similar talk was given almost 20 years ago today
53 / 57
54 / 57
Are we halfway through the second major language shift?
800 C/C++
Python
700 Jupyter Notebook
TeX
600 Java
500 R
VHDL
400 FORTRAN
Julia
300 Go
200
100
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
year
55 / 57
Are we halfway through the second major language shift?
800 C/C++
Python
700 Jupyter Notebook
TeX
600 Java
500 R
VHDL
400 FORTRAN
Julia
300 Go
200
100
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
year
The shift from Fortran to C++ was a decision made by collaboration leaders.
55 / 57
Are we halfway through the second major language shift?
800 C/C++
Python
700 Jupyter Notebook
TeX
600 Java
500 R
VHDL
400 FORTRAN
Julia
300 Go
200
100
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
year
The shift from Fortran to C++ was a decision made by collaboration leaders.
What we see here are individuals choosing a language for their own work.
55 / 57
56 / 57
Informal summary of the workshop at tomorrow’s
LPC Physics Forum at 1:30pm.
57 / 57