-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Hide 'align' instruction behind jmp #60787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hide 'align' instruction behind jmp #60787
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsOverviewWith current loop alignment, Here is the sample diff where I have also added DesignA new datastructure
ImpactIdeally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the As expected, there is no code size difference in arm64 because we just moved the
Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba Contributes to #43227
|
|
@dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional questions/comments:
- Is there a measured perf difference? Or is this speculative, and we'll see what the lab shows?
- What if there are many "jmp" before the loop to align? Maybe the first one is a poor choice because it is in a hot loop. Or maybe the only one prior is in a hot loop. Maybe we shouldn't move it then, due to impacting the I-cache? Should we look at block weights to decide?
- It seems like there should be an "alignment planning" pass that decides where to put the alignment instructions, and annotates the BasicBlocks with those decisions. It could both set a flag that an alignment instruction is needed, and include a pointer to the BasicBlock for the loop we are aligning. Then, the codegen loop would just act on those decisions, and be very simple. Interleaving the planning and codegen loops seems complicated. It seems like this might remove the need for the
alignBBListslist. - Is there an end-to-end design written down in comments in one place? It seems like there should be.
src/coreclr/jit/lclvars.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very strange to put this code in this function, since it has nothing to do with ref counts. Why is it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't wanted to do another iteration for BasicBlocks because for long methods, it can be costly. With that said, I have created a separate phase for align placement where I now iterate over BasicBlock. However, there is still room for improvement there. In my latest code (which I will push shortly), bbPlaceLoopAlignInstructions(), I want to skip the pass if there are no loops to align. It turns out that the only reliable way to check that after lowering is to check for BB->isLoopAlign(). I would really like to piggy-back in this ref counting method because that is the last thing that is executed before code gen and the state of the BBF_LOOP_ALIGN is accurate at that point and we would save iterating over the BasicBlocks again.
A simple logic in this method like below will help avoid iterating over the basic blocks list for scenarios where we add loop alignment initially but then removing it during flow graph analysis because of loops with calls, loop unrolling, compacting, etc. Currently, I have added a needsLoopAlignment that gets set whenever we mark a loop as BBF_LOOP_ALIGN but will never unset if we unmark any loop for alignment.
needsLoopAlignment |= block->isLoopAlign();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?
It's really unpleasant to have unrelated phases tied together in a somewhat implicit contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?
Yes, that's exactly what I tried doing, but turns out that sometimes a block/loop is marked as not needing alignment multiple times specially in AddContainsCallAllContainingLoops() that miscalculates the count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in AddContainsCallAllContainingLoops() that miscalculates the count.
Why would it do that? Once you've cleared the bit, you wouldn't decrement again. e.g.,
void Compiler::AddContainsCallAllContainingLoops(unsigned lnum)
{
#if FEATURE_LOOP_ALIGN
// If this is the inner most loop, reset the LOOP_ALIGN flag
// because a loop having call will not likely to benefit from
// alignment
if (optLoopTable[lnum].lpChild == BasicBlock::NOT_IN_LOOP)
{
BasicBlock* first = optLoopTable[lnum].lpFirst;
if (first->isLoopAlign())
{
assert(compAlignedLoopCount > 0);
--compAlignedLoopCount;
first->bbFlags &= ~BBF_LOOP_ALIGN;
JITDUMP("Removing LOOP_ALIGN flag for " FMT_LP " that starts at " FMT_BB " because loop has a call.\n", lnum,
first->bbNum);
}
}
#endif
...
src/coreclr/jit/emit.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to implement this goal without creating a emitPrevIG "global"?
src/coreclr/jit/emitxarch.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment? Or uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally left it as commented so during debugging, we could uncomment and see the instruction. Let me know if you think otherwise.
75eb833 to
0aefa51
Compare
src/coreclr/jit/codegen.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to delete this.
|
@BruceForstall - This should be ready for the review. Here are the changes:
|
|
CodeSize diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1464807&view=ms.vss-build-web.run-extensions-tab
I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind |
Should PerfScore not count align instructions in an IG following an unconditional branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
src/coreclr/jit/emitxarch.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it commented intentionally so we can quickly uncomment and check it in disassembly. If you still prefer, I can delete it.
0aefa51 to
12c592d
Compare
Let me see if it can be done easily. |
Thanks for the suggestion. This gives the real data about PerfScore which is much nicer.
|
fix the alignBytesRemoved Some fixes and working model Some fixes and redesign Some more fixes more fixes fix Add the check for fgFirstBB misc changes code cleanup + JitHideAlignBehindJmp switch validatePadding only if align are before the loop IG More cleanup, remove commented code jit format
…st the targetIG to prevIG Add IGF_REMOVED_ALIGN flag for special scenarios
d965663 to
57759d0
Compare
|
@BruceForstall - Can you review it again? I think I have addressed all the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One possible follow-up.
Co-authored-by: Bruce Forstall <brucefo@microsoft.com>

Overview
With current loop alignment,
aligninstruction are placed just before the loop start. This can sometimes affect performance adversely if the block in whichaligninstruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding ofnop. This PR places thealigninstructions behind unconditionaljmpinstructions if they exist before the loop that it is trying to align. If no suchjmpwere present, thenalignwill be placed right before the loop start, the one it is done today.Here is the sample diff where
aligninstruction was moved fromIG31and placed intoIG29after thejmpinstruction.I have also added
COMPlus_JitHideAlignBehindJmpto turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emitINS_BREAKPOINTinstead ofalignin situation where thealignare placed behindjmpto make sure that we never execute them.Design
A new datastructure
alignBlocksListis created which is a linked list of allBasicBlockthat are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from thealignBlocksListand monitor if there is any unconditional jmp. If we find one, we emitaligninstruction and set the BB flagBBF_LOOP_ALIGN_ADDED. This makes sure that if we see morejmpbefore the actual loop start, we do not addaligninstructions again. When we reach a point in the flow where next block is the loop start, we would update thetargetIG(see below) of thealignInstr. At this time, we also determine that if we didn't see anyjmpso far (usingBBF_LOOP_ALIGN_ADDED), we would emit thealigninstruction. Finally, we move to the nextBasicBlockofalignBlocksList.instrDescAligndata structure has been updated.idaIGfield in it now points to the IG that contains thealigninstruction. This IG can be before the loop or can be some on the previous IG that ends withjmp.idaTargetIGis the IG that earlier used to beidaIG. It points to the IG before the IG that has loop. This is used when we want to calculate theloopSize. Some of the changes are around getting the right field wherever necessary.IGF_LOOP_ALIGNflag, which previously used to be on an IG just before the loop IG has been replaced byIGF_HAS_ALIGNand will be on IG that contains thealigninstruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might havealigninstruction for a different IG, flagIGF_REMOVED_ALIGNis added that tells if thealigninstruction present in that IG are removed or not.Impact
Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the
aligninstruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.As expected, there is no code size difference in arm64 because we just moved the
aligninstruction around.Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba
Contributes to #43227