KEMBAR78
Add stm327xx-hal benchmark by jyn514 · Pull Request #802 · rust-lang/rustc-perf · GitHub
Skip to content

Conversation

@jyn514
Copy link
Member

@jyn514 jyn514 commented Nov 30, 2020

See rust-lang/rust#77700 (comment), rust-lang/rust#78761. This is mostly a rustdoc benchmark but I wouldn't be surprised if it stresses rustc too.

@jyn514
Copy link
Member Author

jyn514 commented Dec 1, 2020

🎉
image

@jyn514
Copy link
Member Author

jyn514 commented Dec 1, 2020

2020-12-01T01:59:36.6299955Z    Compiling stm32h7xx-hal v0.8.0 (/tmp/.tmpJWqWGn)
2020-12-01T01:59:39.2718951Z thread 'rustc' panicked at 'forcing query with already existing `DepNode`
2020-12-01T01:59:39.2720828Z - query-key: (DefId(0:580 ~ stm32h7xx_hal[bad2]::pwm::Pins), Some(Channel#294))
2020-12-01T01:59:39.2722764Z - dep-node: super_predicates_that_define_assoc_type(26b07274ba21784b-b3d920d48bce2e7b)', /rustc/b7ebc6b0c1ba3c27ebb17c0b496ece778ef11e18/compiler/rustc_query_system

lol, now rust-lang/rust#79560 is affecting rustc-perf too 😆

@camelid
Copy link
Member

camelid commented Dec 1, 2020

🎉
image

Wow, does that mean you bugadani got a 63% (!) improvement in rustdoc time in rust-lang/rust#77700. (Also, wow that's a cool PR number :)

@jyn514
Copy link
Member Author

jyn514 commented Dec 1, 2020

In instruction counts, on a specific benchmark that stresses that part of rustdoc, but yes :)

@camelid
Copy link
Member

camelid commented Dec 1, 2020

Impressive! Congratulations and thanks @bugadani :)

@bugadani
Copy link

bugadani commented Dec 1, 2020

Well, the idea and basically a review comment for every character belongs to @jyn514 :)

@jyn514
Copy link
Member Author

jyn514 commented Dec 5, 2020

This should hopefully be working now that rustdoc -Z self-profile prints the crate name.

@est31
Copy link
Member

est31 commented Dec 6, 2020

I wouldn't be surprised if it stresses rustc too.

It does stress the impl block overlap pass of rustc, which is usually not very relevant, but in this instance it is. In the past, I used this crate to locally assess the improvements of my PR: rust-lang/rust#78323 (comment)

It would be amazing to have it be part of the benchmark suite.

@bugadani
Copy link

bugadani commented Dec 6, 2020

I wouldn't be surprised if it stresses rustc too.

It does stress the impl block overlap pass of rustc, which is usually not very relevant, but in this instance it is. In the past, I used this crate to locally assess the improvements of my PR: rust-lang/rust#78323 (comment)

It would be amazing to have it be part of the benchmark suite.

AFAIK that's a different kine of STM devices. F4 and H7 series are significantly different. Probably comparably horrific, though.

@est31
Copy link
Member

est31 commented Dec 6, 2020

@bugadani yeah they are different devices but I think their crates are structured more or less the same, so cause the compiler to run into the equal situation. IDK why I've taken stm32f4 instead of one of the other devices but I don't think it matters much.

@bugadani
Copy link

bugadani commented Dec 6, 2020

In the past, I used this crate
My sleepy brain misinterpreted this sentence, sorry.

It probably matters little, which is used. But the embedded world will be happy for any improvements that affect these (and a lot more device support crates, actually)!

@est31
Copy link
Member

est31 commented Dec 6, 2020

It probably matters little, which is used

Indeed, I don't really care which crate is added, as long as it has tons of inherent impl blocks on the same type :).

@jyn514
Copy link
Member Author

jyn514 commented Dec 12, 2020

@Mark-Simulacrum I think this is ready to go, is it waiting on anything from me? Or did you want to add diesel instead (#807) instead of adding both or just stm32?

@Mark-Simulacrum
Copy link
Member

This is adding 10 minutes to the check CI builder - from 20 minutes to 30 with this PR. Some of that is likely noise, but this is too large a delta for the perf server; I don't know how easy it would be to cut down the benchmark here but as-is I cannot justify spending that much time on just one crate on the perf server.

@jyn514
Copy link
Member Author

jyn514 commented Dec 14, 2020

@adamgreig do you have suggestions for making this smaller while still stressing a large number of blanket impls? Right now I'm only passing --features=stm32h742 and that still takes as long as half the other benchmarks combined (as long as script-servo if that means anything to you).

@adamgreig
Copy link

adamgreig commented Dec 15, 2020

Do you mean --features=stm32h743 (not 2)? I don't fully understand all the files in the PR but it still seems to have checks for each feature. The stm32h743 feature is the smallest, but if it's still too big you could just delete large chunks of it, there are no inter-dependencies. The mdma, otg1_hs_host, and dfsdm modules are the largest. You'd have to delete the module file and directory, and then delete relevant lines from mod.rs until it builds, but if this is just hosting a copy of the code that hopefully shouldn't be too bad?

By the way, it's possible to generate the crate all in a single mod.rs file instead of one file per module, if that would be useful for this PR.

Edit: another option is to use a smaller crate in the same family, e.g. stm32f4 or stm32f0, which use the same code generator but have fewer items. I expect they'd still stress the large number of blanket impls, though, unless that's coming from something especially pathological in stm32h7.

@jyn514
Copy link
Member Author

jyn514 commented Dec 15, 2020

Thanks! I was passing 742, I'll try 743 and see how much of a difference it makes. From some naive testing locally though it looks like most of the time is spent compiling stm32h7 itself, which I don't think can be cut down much :/ I'll try stm32f4 next and see if it helps.

@adamgreig
Copy link

Ah, sorry, I've misread and thought this was building stm32h7 rather than stm32h7xx-hal and my suggestions were in that direction. Not sure what you could do to cut down stm32h7xx-hal build times, but maybe @richardeoin will have ideas? I assume the slow documentation is in stm32h7xx-hal rather than stm32h7 itself?

@jyn514
Copy link
Member Author

jyn514 commented Dec 15, 2020

On my fastest machine stm32h7 takes 35 seconds in --release mode and stm32h743 takes 15.6 (for comparison, syn takes 5 seconds). The difference between stm32h743 and 742 is negligible. Rustdoc on the other hand takes 40 seconds for stm32h7 and a whopping 96.4 for stm32h743.

That makes me hopeful that we can fix this just by merging rust-lang/rust#77700.

@Mark-Simulacrum
Copy link
Member

Those times are still likely not tenable for perf, FWIW - syn takes over a minute, and we're not really in a position to add another 3-4 minutes to our duration right now.

@est31
Copy link
Member

est31 commented Dec 15, 2020

do you have suggestions for making this smaller while still stressing a large number of blanket impls? Right now I'm only passing --features=stm32h742 and that still takes as long as half the other benchmarks combined

The stm32f4 crate gives you finer control over the number of blanket impls. See the table here. The comment also contains instructions on how to create snythetic test cases which might be faster overall.

@jyn514
Copy link
Member Author

jyn514 commented Dec 18, 2020

@est31 it sounds like you've already done a lot of work in this area, are you interested in taking the PR over?

@est31
Copy link
Member

est31 commented Dec 18, 2020

@jyn514 sure can do that

@est31
Copy link
Member

est31 commented Dec 18, 2020

@jyn514 as simulacrum doesn't want to merge things that add a minute, maybe instead of this PR we add a simple stress test?

@Mark-Simulacrum
Copy link
Member

One minute is fine. Adding 3+ minutes is not currently tenable :)

@jyn514
Copy link
Member Author

jyn514 commented Dec 18, 2020

A stress test sounds fine to me :) I just want to make sure this is measured somehow.

@therealprof
Copy link

@est31 Any progress? If we wanted to add a full compile test instead of a stress-test we could also use a slightly simpler HAL impl than F7, I'm already using https://github.com/stm32-rs/stm32f0xx-hal to do compile time and code size (which would be a cool feature to add to rustc-perf, too) comparisons.

jyn514 added a commit to jyn514/rust that referenced this pull request Mar 7, 2021
On crates with many blanket impls, such as `stm32`, this inner closure
is executed many hundreds of thousands of times. This makes it very
slightly faster.

Unfortunately, I don't have a good test case for showing that it's
faster until rust-lang/rustc-perf#802 is merged.
jyn514 added 4 commits March 7, 2021 12:24
Before:

```
collector error: Failed to benchmark 'stm32f4', recorded: failed to obtain pkgid in '"/tmp/.tmpLF63db"'
```

After:
```
collector error: Failed to benchmark 'stm32f4', recorded: failed to obtain pkgid in '"/tmp/.tmpAmktlm"': expected success, got exit code: 101

stderr=error: a Cargo.lock must exist for this command

 stdout=
```
This confuses rustc-fake, which expects to only be run when profiled (on
the code being timed, not on dependencies).
@jyn514
Copy link
Member Author

jyn514 commented Mar 7, 2021

One minute is fine. Adding 3+ minutes is not currently tenable :)

@Mark-Simulacrum I changed this to use stm32f4, which takes 18 seconds on a doc run and 106 seconds overall. Is that fast enough?

@Mark-Simulacrum
Copy link
Member

Oh, forgot to copy/paste #807 (comment) here - I'm hoping to land diesel first most likely, but largely land these two benchmarks and evaluate based on the metrics we're now collecting where things stand (and either back things out or not, depending on the effects on queue depth).

@Mark-Simulacrum Mark-Simulacrum merged commit c52ee62 into rust-lang:master Mar 11, 2021
@Mark-Simulacrum
Copy link
Member

Merging this for now, and we can evaluate whether to keep it in depending on how the queue does. So far the addition of diesel looks to have been a minor increase in overlap, leading to slightly longer queues, but my recollection of queueing theory suggests that there's going to be a steep cliff rather than gradually getting worse.

@jyn514 jyn514 deleted the stm32 branch March 11, 2021 14:19
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 29, 2021
rustdoc: Don't enter an infer_ctxt in get_blanket_impls for impls that aren't blanket impls

Less broken version of rust-lang#82856.

get_blanket_impls is a *very* hot region of rustdoc, so even small changes like this should help. Unfortunately I don't have benchmarks for this until rust-lang/rustc-perf#802 is merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants