KEMBAR78
Update readme by gaoteng-git · Pull Request #306 · pytorch/kineto · GitHub
Skip to content

Conversation

@gaoteng-git
Copy link
Contributor

No description provided.

For example, a kernel with only one thread per block can’t fully utilize each SM.

* Est. Achieved Occupancy: The bigger, the better. The definition of occupancy is [here](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm).
* Est. Achieved Occupancy: For most cases such as memory bandwidth bounded kernels, the higher the better. The definition of occupancy is [here](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets simply call it memory bounded instead of memory bandwidth bounded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not quite sure how to make it clear in the case of occupancy. The core logic behind occupancy is to over-subscribe the GPU to hide the latency, since intra-warp context switch is quite fast. Once target occupancy is achieved, it does not make it better after that.

See page 13 of GPU Performance Analysis and Optimization

I think we should remove the whole sentence here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory bounded could be interpreted as size bounded or bandwidth bounded. So I prefer to add "bandwidth".

For example, a kernel with only one thread per block can’t fully utilize each SM.

* Est. Achieved Occupancy: The bigger, the better. The definition of occupancy is [here](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm).
* Est. Achieved Occupancy: For most cases such as memory bandwidth bounded kernels, the higher the better. The definition of occupancy is [here](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not quite sure how to make it clear in the case of occupancy. The core logic behind occupancy is to over-subscribe the GPU to hide the latency, since intra-warp context switch is quite fast. Once target occupancy is achieved, it does not make it better after that.

See page 13 of GPU Performance Analysis and Optimization

I think we should remove the whole sentence here.

@gaoteng-git gaoteng-git marked this pull request as ready for review June 17, 2021 09:42
@gaoteng-git
Copy link
Contributor Author

@cloudhan Cool reference URL. I've added it after this sentence as Reference. Giving user more info to judge is good.

@gaoteng-git gaoteng-git merged commit 29c3c46 into pytorch:plugin/0.2 Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants