-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Run inductor micro benchmark on x86 metal runner #135042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135042
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b531007 with merge base 6c37674 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
We have to update the pre-defined numbers for CPU as well. What does the dashboard looks like after this change? Is CUDA and CPU in the same tab or separate one? I'd prefer they are separate ones. |
I'm ok to do this in a follow up PR. |
I expect the result to be on the same page at https://hud.pytorch.org/benchmark/llms, but is listed under different device types, i.e. |
|
||
test_inductor_micro_benchmark() { | ||
TEST_REPORTS_DIR=$(pwd)/test/test-reports | ||
if [[ "${TEST_CONFIG}" == *cpu* ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to update the pre-defined numbers for CPU as well.
I was about to add this part but accidentally removed it because of a merge conflict. Is there anything else you have in mind that we need?
(just curious to learn more about CPU benchmark is setup, if the remaining part is complex, let's do that in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the perf target in each experiment is for CUDA only, we should extend it to support multiple targets on different devices, but I think we can do it in a separate PR.
@pytorchbot merge -f 'This should just require lint I guess' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from #135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: #135780 Approved by: https://github.com/atalman
With pytorch/pytorch#135042, there is now information about the device arch from the benchmark to separate different CUDA or CPU types. Instead of showing device like CUDA, we need to be more specific, for example: * cpu (x86_64) * cpu (arm64) * cuda (NVIDIA A100-SXM4-40GB) ### Testing https://torchci-git-fork-huydhn-add-cpu-device-llm-a5f029-fbopensource.vercel.app/benchmark/llms shows different devices and their archs.
This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: pytorch#135042 Approved by: https://github.com/yanboliang
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from pytorch#135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: pytorch#135780 Approved by: https://github.com/atalman
This enables inductor micro benchmark on CPU (x86):
arch
column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)The next step would be to run this one cpu arm64, and cuda (a10g).
Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180