-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add CUDA_KERNEL_ASSERT_PRINTF
, a more flexible CUDA_KERNEL_ASSERT_MSG
#160129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @mjkatmeta, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160129
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5814ce2 with merge base 53b8bdb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D79310684 |
I'll look at the lint errors. |
Using printf causes a perf regression during runtime because, as I understand it, its arguments need to be evaluated at runtime even if printf is never triggered, regardless of architecture. Therefore, it is not good to introduce printfs everywhere in the codebase. The tradeoff with perf vs a helper macro may not be worth it. |
OK, I was worried that might be the case. I'd like to learn more about this - I'll ping you separately. Thanks! |
cc @ngimel @eqy @syed-ahmed who would know about the impact of printf on CUDA kernels more than I would. |
torch/headeronly/macros/Macros.h
Outdated
} | ||
#define CUDA_KERNEL_ASSERT_PRINTF(cond, msg, ...) \ | ||
if (C10_UNLIKELY(!(cond))) { \ | ||
(void)(printf( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we not use fmtlib::printf here? It's faster, safer, and has type checking. On C++20, it can even compile time check formatting literal. It's already a dependency too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think cuda supportss fmtlib::printf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding here is pretty shallow so I double-checked this -- yes, fmt::printf appears to be host-only.
I tried importing fmt/printf.h
and using fmt::printf
, I get the following error:
fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu(20): error: calling a __host__ function("int fmt::v9::printf<char [319], unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, long, long , (int)0> (const T1 &, const T2 &...)") from a __global__ function("compute_cuda_kernel<long> ") is not allowed
#if defined(__ANDROID__) || defined(__APPLE__) || defined(__FreeBSD__) | ||
// Those platforms do not support assert() | ||
#define CUDA_KERNEL_ASSERT(cond) | ||
#define CUDA_KERNEL_ASSERT_MSG(cond, msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, this doesn't do anything with the message due to some old breakage. Fixing it would be helpful.
aten/src/ATen/native/cuda/Repeat.cu
Outdated
threadIdx.z, | ||
result_size, | ||
cumsum_ptr[size - 1 ] | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be an illustrative example or cover some existing failure modes?
Naively I would guess that this check would be doable on the host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cumsum_ptr
is not available on the host, this check cannot be done on the host without h2d sync. Same thing is happening in all indexing ops - the check cannot be done on host without a sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add a little background: I picked this location because this is where @drdarshan instrumented the kernel with a printf manually. This extra output helped debugging a production training issues that was ultimately caused by bad data (the exact sizes of the values being compared were useful to folks who knew the model).
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout: https://fburl.com/mlhub/5sahdjqf Example message: ``` [trainers2]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:31: Assertion failed: `result_size == cumsum_ptr[size - 1]`: block: [100,0,0], thread: [383,0,0] Invalid input! In `repeat_interleave`, the `output_size` argument (904397) must be the same as the sum of the elements in the `repeats` tensor (904593). ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684
9329372
to
ccd35df
Compare
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout: https://fburl.com/mlhub/5sahdjqf Example message: ``` [trainers2]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:31: Assertion failed: `result_size == cumsum_ptr[size - 1]`: block: [100,0,0], thread: [383,0,0] Invalid input! In `repeat_interleave`, the `output_size` argument (904397) must be the same as the sum of the elements in the `repeats` tensor (904593). ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684
4ed84d8
to
76b551c
Compare
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
Update this to include |
76b551c
to
796efcb
Compare
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
796efcb
to
4046382
Compare
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: Pull Request resolved: pytorch#160129 This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
4f53ea7
to
e27bf15
Compare
This pull request was exported from Phabricator. Differential Revision: D79310684 |
e27bf15
to
b08cc82
Compare
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
I re-ran benchmarks using what I believe is a more trustworthy method, and against more kernels -- I'm using
I am not sure how to resolve the conflicting opinions on how we should proceed:
Are there different performance tests I can run that would definitively answer the performance question? What is the impact if we modify one of the kernels and then, say, we see a real regression? I hope we collectively agree on a path forward. Thanks! |
My position is that adding even potential perf regressions (and increasing register count is definitely a potential performance regression) for low signal debug prints is never worth it. From current asserts we are already getting which kernel asserted, which assert fired, we don't know the exact values (e.g. we had index 32 and maximum allowable size is 31) but that rarely helps - some previous kernel, not even the one throwing assertion, produced invalid data. |
b08cc82
to
75a4d31
Compare
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
This pull request was exported from Phabricator. Differential Revision: D79310684 |
…MSG` (pytorch#160129) Summary: Pull Request resolved: pytorch#160129 This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
75a4d31
to
e727e3b
Compare
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
e727e3b
to
5814ce2
Compare
@mjkatmeta has exported this pull request. If you are a Meta employee, you can view the originating diff in D79310684. |
@ngimel, @drdarshan, @janeyx99 - based on the side discussions we've had, we've agreed to instrument only non-performance-critical CUDA kernels, such as repeat_interleave (in Repeat.cu). This PR is updated to:
Let me know if there are any other changes you'd like to see. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.
We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by
__assert_fail
. There's also an easy-to-grep-for keywordCUDA_KERNEL_ASSERT
in the message.I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other
CUDA_KERNEL_ASSERT*
variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).Alternatives
CUDA_KERNEL_ASSERT_MSG
. That would mean introducingprintf
calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.CUDA_KERNEL_ASSERT
callsites without a macro, similar to Slightly improve error message from repeat_interleave kernel #157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.Risks/Problems
%s
, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.Benchmarks
Change in generated PTX
This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
with printf
This is the version of the code that appears in this diff:
https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a
without printf
I recompiled, replacing
CUDA_KERNEL_ASSERT_PRINTF(...)
in Repeat.cu with:https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d
(Both of these are annotated with
// CHAR ARRAY:
comments to make the string constants easier to read.)Test Plan:
Running this minimal test case:
Now we see the new message (from printf) alongside the assert failure:
Rollback Plan:
Reviewed By: mradmila
Differential Revision: D79310684