KEMBAR78
Add diesel to the rustc perf test suite by weiznich · Pull Request #807 · rust-lang/rustc-perf · GitHub
Skip to content

Conversation

@weiznich
Copy link
Contributor

@weiznich weiznich commented Dec 3, 2020

As far as I know diesel is a rather strange workload for rustc.
According to some short measurements most of the time compiling diesel is
spend type checking the crate and resolving trait bounds. I see multiple
reasons for this:

  • Diesel builts a complex abstract query dsl for SQL based on rust
    generics. All fo this needs to be type checked.
  • Diesel generates a ton of trait impls for tuples of various sizes.
    There are features to set the supported max size to 16, 32, 64 and 128
    tuple elements. As this is a benchmark, I've chosen to set it to 128 to
    maximize the number of impls.

As consequence of this diesels compile times are quite sensitive to
changes touching the type system in general and the trait resolution in
detail. Any change that will introduce a behavior which does not scale
well with the number of available trait impls will likely result in a huge
increase for this benchmark.

As suggested by @jyn514 in rust-lang/rust#79599

@weiznich weiznich force-pushed the add_diesel_to_rustc_bench branch 2 times, most recently from 886a95e to 31fd220 Compare December 3, 2020 12:45
@jyn514
Copy link
Member

jyn514 commented Dec 3, 2020

Hmm, I just realized this has a lot of overlap with #802. Maybe we should only add one or the other? I like #802 just because I know it stresses rustdoc - @weiznich how long does rustdoc take to run on diesel?

@weiznich
Copy link
Contributor Author

weiznich commented Dec 3, 2020

@jyn514 I'm not sure if #802 is really comparable. Diesel stresses not only rustdoc, but all of rustc's trait resolution related code. We've hit quite a few of performance related issues there. (See some of the issues opened by me at the rustc repo). The submitted test configuration really stresses rustc/rustdoc/… as it implements a lot of traits for tuples up to 128 values. (That is happening via this macro).) As already mentioned here that really takes quite a lot of time. (On my laptop it takes ~17 minutes to run cargo doc --no-deps, while requiring >10GB RAM)

Edit: I should probably add that I do not expect that this gets much better any time soon. I assume that significant improvements would need need substantial changes to diesel + maybe something like variadic generics.

As far as I know diesel is a rather strange workload for rustc.
According to some short measurements most of the time compiling diesel is
spend type checking the crate and resolving trait bounds. I see multiple
reasons for this:

* Diesel builts a complex abstract query dsl for SQL based on rust
generics. All fo this needs to be type checked.
* Diesel generates a ton of trait impls for tuples of various sizes.
There are features to set the supported max size to 16, 32, 64 and 128
tuple elements. As this is a benchmark, I've chosen to set it to 128 to
maximize the number of impls.

As consequence of this diesels compile times are quite sensitive to
changes touching the type system in general and the trait resolution in
detail. Any change that will introduce a behaviour which does not scale
well with the number of available trait impls will likely result in a huge
increase for this benchmark.
@weiznich weiznich force-pushed the add_diesel_to_rustc_bench branch from 31fd220 to 5d6ade5 Compare December 3, 2020 13:32
@Mark-Simulacrum
Copy link
Member

It looks like this adds roughly 1.5 hours of CI time to some of our builders here, and while the perf machine is more powerful we cannot afford that big a timesink for just one benchmark. I imagine lowering from 128 to e.g. 32 might help there, but as-is I cannot merge this.

@weiznich
Copy link
Contributor Author

@Mark-Simulacrum I can just lower the supported tuple size to 32 if that helps, but I should probably add that the compile times do not scale linearly here in my experience, so going from supporting tuples up to 32 elements to tuples up to 64 elements does more than double the compile time. It will definitively speedup the benchmark by a lot, but I'm not sure what exactly that means for compiler internals.

@weiznich
Copy link
Contributor Author

I've pushed the 32-columns variant, if that continues to take to much time we can go down another step and use the implicit default 16-column feature.

@jyn514
Copy link
Member

jyn514 commented Dec 15, 2020

@weiznich a way to check it's stressing the same things is by running the benchmark locally with --self-profile, if roughly the same queries are taking most of the time that means it's still a good benchmark.

@weiznich
Copy link
Contributor Author

The following comment contains the output of summarize summarize -p 1 profile for running cargo build using the different feature flags on diesel:

`default = []` (so only generate impls for up to 16 tuple elements)
Item Self time % of total time Time Item count
typeck 1.23s 18.973 1.36s 1817
expand_crate 729.72ms 11.263 749.28ms 1
mir_borrowck 578.55ms 8.930 1.41s 1817
check_item_well_formed 328.82ms 5.075 412.05ms 5520
evaluate_obligation 211.08ms 3.258 216.04ms 21291
mir_built 206.45ms 3.187 308.50ms 1817
LLVM_module_codegen_emit_obj 171.76ms 2.651 171.76ms 66
optimized_mir 171.70ms 2.650 499.90ms 1988
type_op_prove_predicate 162.80ms 2.513 163.16ms 13902
mir_drops_elaborated_and_const_checked 160.75ms 2.481 262.09ms 1817
check_impl_item_well_formed 157.86ms 2.436 239.52ms 2486
check_mod_item_types 152.42ms 2.353 163.29ms 135
normalize_projection_ty 126.15ms 1.947 126.24ms 4664
type_op_ascribe_user_type 111.36ms 1.719 111.41ms 1806
resolve_crate 109.78ms 1.694 109.78ms 1
generate_crate_metadata 97.10ms 1.499 646.62ms 1
param_env 89.74ms 1.385 115.86ms 8690
hir_lowering 88.33ms 1.363 88.33ms 1
check_mod_privacy 84.71ms 1.308 85.61ms 135
LLVM_passes 83.95ms 1.296 83.95ms 1
specialization_graph_of 66.81ms 1.031 101.17ms 145

Total cpu time: 6.478858514s
Filtered results account for 79.012% of total time.

`default = ["32-column-tables"]`
Item Self time % of total time Time Item count
typeck 5.37s 23.476 5.75s 2249
expand_crate 3.18s 13.897 3.19s 1
mir_borrowck 2.12s 9.262 4.86s 2249
check_item_well_formed 1.59s 6.954 1.80s 6224
evaluate_obligation 667.61ms 2.920 675.63ms 47627
mir_built 666.40ms 2.915 969.31ms 2249
check_mod_item_types 569.95ms 2.493 593.27ms 135
check_impl_item_well_formed 560.54ms 2.451 792.59ms 3174
type_op_prove_predicate 553.23ms 2.420 554.12ms 32054
mir_drops_elaborated_and_const_checked 538.08ms 2.353 851.83ms 2249
check_mod_privacy 523.02ms 2.287 524.57ms 135
type_op_ascribe_user_type 501.22ms 2.192 501.32ms 5446
optimized_mir 500.68ms 2.190 1.59s 2436
normalize_projection_ty 420.51ms 1.839 420.59ms 10000
resolve_crate 401.20ms 1.755 401.20ms 1
specialization_graph_of 324.67ms 1.420 361.09ms 145
hir_lowering 318.75ms 1.394 318.75ms 1
param_env 238.53ms 1.043 275.20ms 10098

Total cpu time: 22.865035499s
Filtered results account for 83.261% of total time.

`default = ["64-columns-table"]`
Item Self time % of total time Time Item count
expand_crate 26.75s 25.100 26.90s 1
typeck 22.37s 20.987 23.25s 3113
mir_borrowck 9.09s 8.528 20.22s 3113
check_item_well_formed 6.22s 5.835 6.79s 7632
check_mod_privacy 5.16s 4.840 5.16s 135
type_op_ascribe_user_type 2.59s 2.427 2.59s 19638
specialization_graph_of 2.44s 2.289 2.51s 145
type_op_prove_predicate 2.43s 2.277 2.43s 98310
evaluate_obligation 2.32s 2.181 2.33s 144843
mir_drops_elaborated_and_const_checked 2.22s 2.080 3.59s 3113
mir_built 2.22s 2.080 3.33s 3113
check_mod_item_types 1.98s 1.860 2.01s 135
check_impl_item_well_formed 1.97s 1.846 2.47s 4550
optimized_mir 1.89s 1.773 6.55s 3332
resolve_crate 1.87s 1.759 1.87s 1
normalize_projection_ty 1.79s 1.682 1.79s 29120
hir_lowering 1.28s 1.203 1.28s 1

Total cpu time: 106.57413693s
Filtered results account for 88.749% of total time.

`default = ["128-columns-table"]`
Item Self time % of total time Time Item count
expand_crate 262.36s 34.810 262.37s 1
typeck 120.41s 15.976 123.73s 4841
check_mod_privacy 74.11s 9.833 74.12s 135
mir_borrowck 50.40s 6.687 112.08s 4841
check_item_well_formed 44.92s 5.960 49.00s 10448
specialization_graph_of 21.17s 2.809 21.21s 145
type_op_ascribe_user_type 16.66s 2.210 16.66s 75670
type_op_prove_predicate 14.85s 1.970 14.85s 350630
check_mod_item_types 14.15s 1.878 14.20s 135
check_impl_item_well_formed 12.44s 1.650 14.38s 7302
evaluate_obligation 11.53s 1.530 11.54s 517451
normalize_projection_ty 10.78s 1.431 10.78s 101152
mir_drops_elaborated_and_const_checked 10.41s 1.382 18.18s 4841
mir_built 9.04s 1.199 14.42s 4841
resolve_crate 8.13s 1.079 8.13s 1
optimized_mir 7.97s 1.058 32.16s 5124

Total cpu time: 753.685691273s
Filtered results account for 91.461% of total time.

As the raw numbers are quite hard to compare, here a comparison for all passes that uses more than 5% of the compilation time in at least one of the runs:

16 32 64 128
typeck 18.973% 23.476% 20.987% 15.976%
expand_crate 11.263% 13.897% 25.1% 34.81%
mir_borrowck 8.93% 9.262% 8.528% 6.687%
check_item_well_formed 5.075% 6.954% 5.835% 5.96%
check_mod_privacy 1.308% 2.287% 4.84% 9.833%

I would have expected that if the number of tuple elements would only influence the amount of work put to each query that the those relative numbers stay at the same level, but it seems like there are a few passes that are much more important for large tuple sizes. Namely those are expand_crate (Expanding the large __diesel_for_each_tuple!(tuple_impls); macro call?) and check_mod_privacy (Unsure why this happens). Given those points I'm not sure how useful it would be to just use the faster smaller variant as benchmark.

@Mark-Simulacrum
Copy link
Member

Yeah I'm unsure too. I am still not comfortable merging this PR with the current variant, it just adds too much time, and your results seem to indicate scaling that is unexpectedly non-linear in terms of time allocated to various passes - we might just not be able to pull off diesel without first speeding up the compiler or more work on the collection infra first.

@weiznich
Copy link
Contributor Author

I totally understand that this comes with a large time cost, but on the other hand it seems like those two passes do matter only above a certain threshold of code generation size. That seems to be something that cannot be reproduced in a smaller test case as far as I see.

@weiznich
Copy link
Contributor Author

weiznich commented Jan 7, 2021

@Mark-Simulacrum Any new insights or suggestions here how to continue on that problem?

@Mark-Simulacrum
Copy link
Member

I've recently added metrics tracking for the queue length on perf.rust-lang.org, and I hope that in 1-2 weeks when we have some data there we can merge this and see if it has any significant impact (and then make a more informed decision).

@Mark-Simulacrum
Copy link
Member

Ok, we have some initial data from the queue length metrics, and it looks like we have relatively speaking some downtime and aren't constantly fighting to keep up - I'm going to go ahead and merge this PR, but may back it out if the queue times end up being too long or similar.

These are real programs that are known to stress the compiler in interesting
ways.

- **diesel**: A type save SQL query builder. Utilizes the type system to
Copy link

@SunnyWar SunnyWar Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say, "A type safe SQL query builder."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants