Hide 'align' instruction behind jmp #60787

kunalspathak · 2021-10-22T21:46:29Z

Overview

With current loop alignment, align instruction are placed just before the loop start. This can sometimes affect performance adversely if the block in which align instruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding of nop. This PR places the align instructions behind unconditional jmp instructions if they exist before the loop that it is trying to align. If no such jmp were present, then align will be placed right before the loop start, the one it is done today.

Here is the sample diff where align instruction was moved from IG31 and placed into IG29 after the jmp instruction.

I have also added COMPlus_JitHideAlignBehindJmp to turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emit INS_BREAKPOINT instead of align in situation where the align are placed behind jmp to make sure that we never execute them.

Design

A new datastructure alignBlocksList is created which is a linked list of all BasicBlock that are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from the alignBlocksList and monitor if there is any unconditional jmp. If we find one, we emit align instruction and set the BB flag BBF_LOOP_ALIGN_ADDED. This makes sure that if we see more jmp before the actual loop start, we do not add align instructions again. When we reach a point in the flow where next block is the loop start, we would update the targetIG (see below) of the alignInstr. At this time, we also determine that if we didn't see any jmp so far (using BBF_LOOP_ALIGN_ADDED), we would emit the align instruction. Finally, we move to the next BasicBlock of alignBlocksList.

instrDescAlign data structure has been updated. idaIG field in it now points to the IG that contains the align instruction. This IG can be before the loop or can be some on the previous IG that ends with jmp. idaTargetIG is the IG that earlier used to be idaIG. It points to the IG before the IG that has loop. This is used when we want to calculate the loopSize. Some of the changes are around getting the right field wherever necessary.

IGF_LOOP_ALIGN flag, which previously used to be on an IG just before the loop IG has been replaced by IGF_HAS_ALIGN and will be on IG that contains the align instruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might have align instruction for a different IG, flag IGF_REMOVED_ALIGN is added that tells if the align instruction present in that IG are removed or not.

Impact

Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the align instruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.

As expected, there is no code size difference in arm64 because we just moved the align instruction around.

collection	platform	main	PR	diff	diff %	methods regressed	methods improved	methods unchanged
libraries.pmi	windows.x64	577335	577170	-165	-0.03%	27	40	546
benchmarks.run	windows.x64	277173	277211	38	0.01%	19	13	181
coreclr_tests.pmi	windows.x64	188577	188598	21	0.01%	9	7	184
aspnet.run	windows.x64	113942	113995	53	0.05%	3	3	66
benchmarks.run	windows.arm64	201028	201028	0	0.00%	0	0	132
coreclr_tests.pmi	windows.arm64	118012	118012	0	0.00%	0	0	234
libraries.pmi	windows.arm64	372600	372600	0	0.00%	0	0	405

Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba

Contributes to #43227

ghost · 2021-10-22T21:46:38Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

With current loop alignment, align instruction are placed just before the loop start. This can sometimes affect performance adversely if the block in which align instruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding of nop. This PR places the align instructions behind unconditional jmp instructions if they exist before the loop that it is trying to align. If no such jmp were present, then align will be placed right before the loop start, the one it is done today.

Here is the sample diff where align instruction was moved from IG31 and placed into IG29 after the jmp instruction.

I have also added COMPlus_JitHideAlignBehindJmp to turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emit INS_BREAKPOINT instead of align in situation where the align are placed behind jmp to make sure that we never execute them.

Design

A new datastructure alignBlocksList is created which is a linked list of all BasicBlock that are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from the alignBlocksList and monitor if there is any unconditional jmp. If we find one, we emit align instruction and set the BB flag BBF_LOOP_ALIGN_ADDED. This makes sure that if we see more jmp before the actual loop start, we do not add align instructions again. When we reach a point in the flow where next block is the loop start, we would update the targetIG (see below) of the alignInstr. At this time, we also determine that if we didn't see any jmp so far (using BBF_LOOP_ALIGN_ADDED), we would emit the align instruction. Finally, we move to the next BasicBlock of alignBlocksList.

instrDescAlign data structure has been updated. idaIG field in it now points to the IG that contains the align instruction. This IG can be before the loop or can be some on the previous IG that ends with jmp. idaTargetIG is the IG that earlier used to be idaIG. It points to the IG before the IG that has loop. This is used when we want to calculate the loopSize. Some of the changes are around getting the right field wherever necessary.

IGF_LOOP_ALIGN flag, which previously used to be on an IG just before the loop IG has been replaced by IGF_HAS_ALIGN and will be on IG that contains the align instruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might have align instruction for a different IG, flag IGF_REMOVED_ALIGN is added that tells if the align instruction present in that IG are removed or not.

Impact

Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the align instruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.

As expected, there is no code size difference in arm64 because we just moved the align instruction around.

collection	platform	main	PR	diff	diff %	methods regressed	methods improved	methods unchanged
libraries.pmi	windows.x64	577335	577170	-165	-0.03%	27	40	546
benchmarks.run	windows.x64	277173	277211	38	0.01%	19	13	181
coreclr_tests.pmi	windows.x64	188577	188598	21	0.01%	9	7	184
aspnet.run	windows.x64	113942	113995	53	0.05%	3	3	66
benchmarks.run	windows.arm64	201028	201028	0	0.00%	0	0	132
coreclr_tests.pmi	windows.arm64	118012	118012	0	0.00%	0	0	234
libraries.pmi	windows.arm64	372600	372600	0	0.00%	0	0	405

Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba

Contributes to #43227

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2021-10-22T21:46:41Z

@dotnet/jit-contrib

BruceForstall

Some additional questions/comments:

Is there a measured perf difference? Or is this speculative, and we'll see what the lab shows?
What if there are many "jmp" before the loop to align? Maybe the first one is a poor choice because it is in a hot loop. Or maybe the only one prior is in a hot loop. Maybe we shouldn't move it then, due to impacting the I-cache? Should we look at block weights to decide?
It seems like there should be an "alignment planning" pass that decides where to put the alignment instructions, and annotates the BasicBlocks with those decisions. It could both set a flag that an alignment instruction is needed, and include a pointer to the BasicBlock for the loop we are aligning. Then, the codegen loop would just act on those decisions, and be very simple. Interleaving the planning and codegen loops seems complicated. It seems like this might remove the need for the alignBBLists list.
Is there an end-to-end design written down in comments in one place? It seems like there should be.

src/coreclr/jit/block.h

src/coreclr/jit/compiler.h

BruceForstall · 2021-10-23T00:31:29Z

src/coreclr/jit/lclvars.cpp

It seems very strange to put this code in this function, since it has nothing to do with ref counts. Why is it here?

I didn't wanted to do another iteration for BasicBlocks because for long methods, it can be costly. With that said, I have created a separate phase for align placement where I now iterate over BasicBlock. However, there is still room for improvement there. In my latest code (which I will push shortly), bbPlaceLoopAlignInstructions(), I want to skip the pass if there are no loops to align. It turns out that the only reliable way to check that after lowering is to check for BB->isLoopAlign(). I would really like to piggy-back in this ref counting method because that is the last thing that is executed before code gen and the state of the BBF_LOOP_ALIGN is accurate at that point and we would save iterating over the BasicBlocks again.

A simple logic in this method like below will help avoid iterating over the basic blocks list for scenarios where we add loop alignment initially but then removing it during flow graph analysis because of loops with calls, loop unrolling, compacting, etc. Currently, I have added a needsLoopAlignment that gets set whenever we mark a loop as BBF_LOOP_ALIGN but will never unset if we unmark any loop for alignment.

needsLoopAlignment |= block->isLoopAlign();

Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?

It's really unpleasant to have unrelated phases tied together in a somewhat implicit contract.

Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?

Yes, that's exactly what I tried doing, but turns out that sometimes a block/loop is marked as not needing alignment multiple times specially in AddContainsCallAllContainingLoops() that miscalculates the count.

in AddContainsCallAllContainingLoops() that miscalculates the count.

Why would it do that? Once you've cleared the bit, you wouldn't decrement again. e.g.,

void Compiler::AddContainsCallAllContainingLoops(unsigned lnum) { #if FEATURE_LOOP_ALIGN // If this is the inner most loop, reset the LOOP_ALIGN flag // because a loop having call will not likely to benefit from // alignment if (optLoopTable[lnum].lpChild == BasicBlock::NOT_IN_LOOP) { BasicBlock* first = optLoopTable[lnum].lpFirst; if (first->isLoopAlign()) { assert(compAlignedLoopCount > 0); --compAlignedLoopCount; first->bbFlags &= ~BBF_LOOP_ALIGN; JITDUMP("Removing LOOP_ALIGN flag for " FMT_LP " that starts at " FMT_BB " because loop has a call.\n", lnum, first->bbNum); } } #endif ...

src/coreclr/jit/jitconfigvalues.h

BruceForstall · 2021-10-23T01:28:30Z

src/coreclr/jit/emit.cpp

Is it possible to implement this goal without creating a emitPrevIG "global"?

src/coreclr/jit/emitarm64.cpp

src/coreclr/jit/emitxarch.cpp

BruceForstall · 2021-10-23T01:32:18Z

src/coreclr/jit/emitxarch.cpp

Remove comment? Or uncomment?

I intentionally left it as commented so during debugging, we could uncomment and see the instruction. Let me know if you think otherwise.

kunalspathak · 2021-11-11T08:44:49Z

src/coreclr/jit/codegen.h

Need to delete this.

kunalspathak · 2021-11-11T20:11:59Z

@BruceForstall - This should be ready for the review. Here are the changes:

Added bbPlaceLoopAlignInstructions() pass to label the blocks that should contain align instruction. It prefers cold blocks if there are multiple candidates. Also track loopAlignCandidates and skip this phase if loopAlignCandidates == 0. The code is much simpler now.
Removed the need of prevIG that I was tracking for few edge scenarios and instead force a new IG before loop head.
I have created loopHeadIG() that retrieves the loop head IG that we are interested in most often. I still need to track the previous IG (idaTargetIG has been renamed to idaLoopHeadPredIG) so I can know after what point I can re-enable the VEX encoding optimization. If we decide to re-enable the VEX encoding after loop head IG, then I noticed that the deoptimized version of VEX encoding is generated for instructions inside the loop.

kunalspathak · 2021-11-11T20:21:11Z

CodeSize diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1464807&view=ms.vss-build-web.run-extensions-tab
PerfScore diffs:

Name	Diff	Methods regressed	Methods improved	Methods unchanged
benchmarks.run.windows.x64.checked	-6.69	28	90	100
libraries.pmi.windows.x64.checked	-142.01	59	242	319
aspnet.run.windows.x64.checked	-333.71	10	39	28
coreclr_tests.pmi.windows.x64.checked	-34.29	52	64	67
benchmarks.run.windows.arm64.checked	-2052.67	1	38	83
libraries.pmi.windows.arm64.checked	-714.32	13	104	307
coreclr_tests.pmi.windows.arm64.checked	84.04	153	28	38

I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind jmp and those blocks are expensive. But again, since they won't be executed, it is fine to have that way. Another option that I thought about is to continue placing align in the block just before the loop if that block is colder than the previous blocks that has jmp instruction. But that might incur the cost of fetching and decoding of those instructions if placed in the block just before the loop. Hence, if I see multiple blocks that ends with jmp, I select the coldest block among them instead and don't worry check the weight of the block that precedes the loop body.

BruceForstall · 2021-11-12T20:08:49Z

I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind jmp and those blocks are expensive.

Should PerfScore not count align instructions in an IG following an unconditional branch?

BruceForstall

A few questions

src/coreclr/jit/compiler.cpp

src/coreclr/jit/compiler.h

src/coreclr/jit/emit.cpp

BruceForstall · 2021-11-12T20:52:44Z

src/coreclr/jit/emitxarch.cpp

I left it commented intentionally so we can quickly uncomment and check it in disassembly. If you still prefer, I can delete it.

kunalspathak · 2021-11-12T23:03:43Z

Should PerfScore not count align instructions in an IG following an unconditional branch?

Let me see if it can be done easily.

kunalspathak · 2021-11-13T02:39:55Z

Let me see if it can be done easily.

Thanks for the suggestion. This gives the real data about PerfScore which is much nicer.

Name	PerfScore diff	Methods Regressed	Methods Improved	Methods NoChange
benchmarks.run.windows.x64	-145.12	18	184	16
libraries.pmi.windows.x64	-270.95	31	471	118
aspnet.run.windows.x64	-440.38	6	64	7
coreclr_tests.pmi.windows.x64	-94.79	12	157	14
benchmarks.run.windows.arm64	-2273.58	0	97	25
libraries.pmi.windows.arm64	-1227.70	0	346	78
coreclr_tests.pmi.windows.arm64	-338.33	0	202	17

fix the alignBytesRemoved Some fixes and working model Some fixes and redesign Some more fixes more fixes fix Add the check for fgFirstBB misc changes code cleanup + JitHideAlignBehindJmp switch validatePadding only if align are before the loop IG More cleanup, remove commented code jit format

…st the targetIG to prevIG Add IGF_REMOVED_ALIGN flag for special scenarios

kunalspathak · 2021-11-18T01:25:23Z

@BruceForstall - Can you review it again? I think I have addressed all the feedback.

BruceForstall

LGTM. One possible follow-up.

src/coreclr/jit/block.cpp

src/coreclr/jit/compiler.cpp

Co-authored-by: Bruce Forstall <brucefo@microsoft.com>

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 22, 2021

BruceForstall self-requested a review October 23, 2021 00:02

BruceForstall suggested changes Oct 23, 2021

View reviewed changes

runfoapp bot mentioned this pull request Oct 25, 2021

System.Runtime.Serialization.Xml.Tests failed #59926

Closed

BruceForstall mentioned this pull request Nov 9, 2021

Create superpmi-asmdiffs pipeline #61194

Merged

kunalspathak force-pushed the hide-align-behind-jmp branch from 75eb833 to 0aefa51 Compare November 11, 2021 08:38

kunalspathak commented Nov 11, 2021

View reviewed changes

src/coreclr/jit/codegen.h Outdated

Copy link

Contributor Author

kunalspathak Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to delete this.

BruceForstall suggested changes Nov 12, 2021

View reviewed changes

kunalspathak force-pushed the hide-align-behind-jmp branch from 0aefa51 to 12c592d Compare November 12, 2021 23:03

This was referenced Nov 15, 2021

Test failure tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh #59296

Closed

Test failure tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh #54801

Closed

kunalspathak added 12 commits November 16, 2021 11:48

Fix a problem where curIG==0 and loop might be emitted in curIG, adju…

e7c0710

…st the targetIG to prevIG Add IGF_REMOVED_ALIGN flag for special scenarios

Add stress mode to emit int3 for xarch

bd922aa

Add stress mode to emit bkpt for arm64

4d0f912

Add a loop align instruction placement phase

8d64351

review comments

9b9b616

Change from unsigned short to unsigned

6302975

review comments around cleanup

d20da6d

emitForceNewIG

c6a2d70

Remove emitPrevIG

e9c5eec

Revert change to forceNewIG for align instruction

c1c5db3

Use loopAlignCandidates

b8a9742

kunalspathak added 10 commits November 16, 2021 11:51

Use loopHeadIG reference

db98ec2

jit format

5ab9edc

Remove unneeded method

c8a9e01

Misc changes

5bb1563

Review feedback

2c6e81d

Do not include align behind Jmp in PerfScore calculation

bbc2ac5

jit format and fix a bug

64bba41

fix the loopCandidates == 0 scenario

1e24fcb

Add unmarkLoopAlign(), add check for fgFirstBB

b301fa5

merge conflict fix

57759d0

kunalspathak force-pushed the hide-align-behind-jmp branch from d965663 to 57759d0 Compare November 16, 2021 20:04

Add missing }

ef0e859

BruceForstall approved these changes Nov 18, 2021

View reviewed changes

src/coreclr/jit/block.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/compiler.cpp Show resolved Hide resolved

Grammar nit

976b253

Co-authored-by: Bruce Forstall <brucefo@microsoft.com>

kunalspathak merged commit 581d4d2 into dotnet:main Nov 18, 2021

AndyAyersMS mentioned this pull request Nov 22, 2021

arm64 jitstress failing with encoding_found assert #61944

Closed

kunalspathak mentioned this pull request Nov 28, 2021

Test failure JIT/Performance/CodeQuality/Benchstones/BenchF/MatInv4/MatInv4/MatInv4.sh #61899

Closed

kunalspathak mentioned this pull request Dec 1, 2021

microsoft.extensions.logging.generators.roslyn3.11.tests work item #62238

Closed

ghost locked as resolved and limited conversation to collaborators Dec 18, 2021

Hide 'align' instruction behind jmp #60787

Hide 'align' instruction behind jmp #60787

Conversation

kunalspathak commented Oct 22, 2021

Overview

Design

Impact

Uh oh!

ghost commented Oct 22, 2021

Overview

Design

Impact

Uh oh!

kunalspathak commented Oct 22, 2021

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Nov 11, 2021

Uh oh!

kunalspathak commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BruceForstall commented Nov 12, 2021

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Nov 12, 2021

Uh oh!

kunalspathak commented Nov 13, 2021

Uh oh!

kunalspathak commented Nov 18, 2021

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

kunalspathak commented Nov 11, 2021 •

edited

Loading