KEMBAR78
Run inductor micro benchmark on x86 metal runner by huydhn · Pull Request #135042 · pytorch/pytorch · GitHub
Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Sep 3, 2024

This enables inductor micro benchmark on CPU (x86):

  • Running on AWS metal runner for more accurate benchmark
  • I add a new arch column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)

The next step would be to run this one cpu arm64, and cuda (a10g).

Testing

Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180

name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135042

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b531007 with merge base 6c37674 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Sep 3, 2024
@huydhn huydhn requested a review from yanboliang September 3, 2024 23:39
@huydhn huydhn marked this pull request as ready for review September 3, 2024 23:39
@huydhn huydhn requested a review from a team as a code owner September 3, 2024 23:39
@yanboliang
Copy link
Contributor

We have to update the pre-defined numbers for CPU as well. What does the dashboard looks like after this change? Is CUDA and CPU in the same tab or separate one? I'd prefer they are separate ones.

@yanboliang
Copy link
Contributor

We have to update the pre-defined numbers for CPU as well.

I'm ok to do this in a follow up PR.

@huydhn
Copy link
Contributor Author

huydhn commented Sep 4, 2024

We have to update the pre-defined numbers for CPU as well. What does the dashboard looks like after this change? Is CUDA and CPU in the same tab or separate one? I'd prefer they are separate ones.

I expect the result to be on the same page at https://hud.pytorch.org/benchmark/llms, but is listed under different device types, i.e.

Screenshot 2024-09-04 at 13 45 54


test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
Copy link
Contributor Author

@huydhn huydhn Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to update the pre-defined numbers for CPU as well.

I was about to add this part but accidentally removed it because of a merge conflict. Is there anything else you have in mind that we need?

(just curious to learn more about CPU benchmark is setup, if the remaining part is complex, let's do that in a separate PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the perf target in each experiment is for CUDA only, we should extend it to support multiple targets on different devices, but I think we can do it in a separate PR.

@huydhn
Copy link
Contributor Author

huydhn commented Sep 5, 2024

@pytorchbot merge -f 'This should just require lint I guess'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 12, 2024
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from #135042.  So, the workflow is running but nothing has been uploaded yet.

Pull Request resolved: #135780
Approved by: https://github.com/atalman
huydhn added a commit to pytorch/test-infra that referenced this pull request Sep 18, 2024
With pytorch/pytorch#135042, there is now
information about the device arch from the benchmark to separate
different CUDA or CPU types. Instead of showing device like CUDA, we
need to be more specific, for example:

* cpu (x86_64)
* cpu (arm64)
* cuda (NVIDIA A100-SXM4-40GB)

### Testing


https://torchci-git-fork-huydhn-add-cpu-device-llm-a5f029-fbopensource.vercel.app/benchmark/llms
shows different devices and their archs.
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This enables inductor micro benchmark on CPU (x86):

* Running on AWS metal runner for more accurate benchmark
* I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU.  We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)

The next step would be to run this one cpu arm64, and cuda (a10g).

### Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180

```
name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False
```
Pull Request resolved: pytorch#135042
Approved by: https://github.com/yanboliang
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from pytorch#135042.  So, the workflow is running but nothing has been uploaded yet.

Pull Request resolved: pytorch#135780
Approved by: https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants