Add `CUDA_KERNEL_ASSERT_PRINTF`, a more flexible `CUDA_KERNEL_ASSERT_MSG` #160129

mjkatmeta · 2025-08-07T19:11:08Z

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by __assert_fail. There's also an easy-to-grep-for keyword CUDA_KERNEL_ASSERT in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other CUDA_KERNEL_ASSERT* variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

Alternatives

We could just update CUDA_KERNEL_ASSERT_MSG. That would mean introducing printf calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
If we want to move more slowly here, I could instrument more CUDA_KERNEL_ASSERT callsites without a macro, similar to Slightly improve error message from repeat_interleave kernel #157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

Risks/Problems

We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains %s, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

Benchmarks

I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):

buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so

with printf

This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

without printf

I recompiled, replacing CUDA_KERNEL_ASSERT_PRINTF(...) in Repeat.cu with:

CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);

https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with // CHAR ARRAY: comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)

Now we see the new message (from printf) alongside the assert failure:

$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

pytorch-bot · 2025-08-07T19:11:10Z

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @mjkatmeta, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

linux-foundation-easycla · 2025-08-07T19:11:12Z

The committers listed above are authorized under a signed CLA.

✅ login: mjkatmeta / name: Michael Kelly (5814ce2)

pytorch-bot · 2025-08-07T19:11:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160129

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5814ce2 with merge base 53b8bdb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-07T19:11:18Z

This pull request was exported from Phabricator. Differential Revision: D79310684

mjkatmeta · 2025-08-07T21:06:48Z

I'll look at the lint errors.

janeyx99 · 2025-08-07T21:43:21Z

Using printf causes a perf regression during runtime because, as I understand it, its arguments need to be evaluated at runtime even if printf is never triggered, regardless of architecture. Therefore, it is not good to introduce printfs everywhere in the codebase. The tradeoff with perf vs a helper macro may not be worth it.

mjkatmeta · 2025-08-07T23:07:31Z

Using printf causes a perf regression during runtime because, as I understand it, its arguments need to be evaluated at runtime even if printf is never triggered, regardless of architecture. Therefore, it is not good to introduce printfs everywhere in the codebase. The tradeoff with perf vs a helper macro may not be worth it.

OK, I was worried that might be the case. I'd like to learn more about this - I'll ping you separately. Thanks!

janeyx99 · 2025-08-08T14:50:13Z

cc @ngimel @eqy @syed-ahmed who would know about the impact of printf on CUDA kernels more than I would.

Skylion007 · 2025-08-08T16:25:38Z

torch/headeronly/macros/Macros.h

  }
+#define CUDA_KERNEL_ASSERT_PRINTF(cond, msg, ...)                           \
+  if (C10_UNLIKELY(!(cond))) {                                              \
+    (void)(printf(                                                          \


Can't we not use fmtlib::printf here? It's faster, safer, and has type checking. On C++20, it can even compile time check formatting literal. It's already a dependency too

I don't think cuda supportss fmtlib::printf

My understanding here is pretty shallow so I double-checked this -- yes, fmt::printf appears to be host-only.

I tried importing fmt/printf.h and using fmt::printf, I get the following error:

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu(20): error: calling a __host__ function("int fmt::v9::printf<char [319], unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, long, long , (int)0> (const T1 &, const T2 &...)") from a __global__ function("compute_cuda_kernel<long> ") is not allowed

Skylion007 · 2025-08-08T16:26:21Z

torch/headeronly/macros/Macros.h

 #if defined(__ANDROID__) || defined(__APPLE__) || defined(__FreeBSD__)
 // Those platforms do not support assert()
 #define CUDA_KERNEL_ASSERT(cond)
 #define CUDA_KERNEL_ASSERT_MSG(cond, msg)


If I recall, this doesn't do anything with the message due to some old breakage. Fixing it would be helpful.

eqy · 2025-08-08T17:15:18Z

aten/src/ATen/native/cuda/Repeat.cu

+    threadIdx.z,
+    result_size,
+    cumsum_ptr[size - 1 ]
+  );


Is this meant to be an illustrative example or cover some existing failure modes?

Naively I would guess that this check would be doable on the host?

cumsum_ptr is not available on the host, this check cannot be done on the host without h2d sync. Same thing is happening in all indexing ops - the check cannot be done on host without a sync

To add a little background: I picked this location because this is where @drdarshan instrumented the kernel with a printf manually. This extra output helped debugging a production training issues that was ultimately caused by bad data (the exact sizes of the values being compared were useful to folks who knew the model).

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout: https://fburl.com/mlhub/5sahdjqf Example message: ``` [trainers2]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:31: Assertion failed: `result_size == cumsum_ptr[size - 1]`: block: [100,0,0], thread: [383,0,0] Invalid input! In `repeat_interleave`, the `output_size` argument (904397) must be the same as the sum of the elements in the `repeats` tensor (904593). ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684

facebook-github-bot · 2025-08-08T21:50:33Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout: https://fburl.com/mlhub/5sahdjqf Example message: ``` [trainers2]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:31: Assertion failed: `result_size == cumsum_ptr[size - 1]`: block: [100,0,0], thread: [383,0,0] Invalid input! In `repeat_interleave`, the `output_size` argument (904397) must be the same as the sum of the elements in the `repeats` tensor (904593). ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684

facebook-github-bot · 2025-08-15T18:59:19Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

mjkatmeta · 2025-08-15T19:14:09Z

Update this to include blockId and threadId. From what I've seen, dumping these out is pretty standard practice when manually adding printfs to aid in debugging. These should always be available in a kernel, and you should get a compile-time error if not.

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

facebook-github-bot · 2025-08-15T19:22:26Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

facebook-github-bot · 2025-08-15T19:27:10Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: Pull Request resolved: pytorch#160129 This new assertion helper bundles a printf call with the assertion. The goal is to make changes like D77904753 easier to copy and more intuitive. Parametrized error messages are a substantial improvement in debuggability and let us avoid a whole cycle of recompiling + re-testing failing training workflows. We include file, line number, and failing condition in the printf (along with the message provided by the user), so the logged line is as self-contained as possible. You can find logs by greppng output for `CUDA_KERNEL_ASSERT` (`lg -k CUDA_KERNEL_ASSERT mast:<mast-job>` I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). I'm one callsite as a demonstration in this diff. # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though - and from the comments in Macros.h that feels risky to me. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * There's a comment in this file that mentions a performance impact of having a printf call in a kernel on ROCm architectures, even if that code path isn't triggered. It's not clear to me if that performance impact exists on other architectures as well. * This still goes to stdout. We can try to redirect users via a mechanism like D78943237 (but with more appropriate hint text). Test Plan: `CUDA_KERNEL_ASSERT` messages go to stdout, with file, line number, block and thread IDs: https://fburl.com/mlhub/6iuz3c58 Example message: ``` [trainers6]:[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25, block: [308,0,0], thread: [166,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the ``` (This is a test of the version that calls `__assert_fail`, specifically.) You can also use: `lg -k CUDA_KERNEL_ASSERT --mast-job-version 11 mast:f757736379-27849973610-TrainingApplication_MXP4V_1LBDZ` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

facebook-github-bot · 2025-08-26T04:46:35Z

This pull request was exported from Phabricator. Differential Revision: D79310684

facebook-github-bot · 2025-09-05T20:20:35Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

mjkatmeta · 2025-09-05T20:29:56Z

I re-ran benchmarks using what I believe is a more trustworthy method, and against more kernels -- I'm using torch.utils.benchmark.Timer now. I don't see a performance impact anymore:

I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
Results are here: https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

I am not sure how to resolve the conflicting opinions on how we should proceed:

We have agreement from AI Infra folks (@drdarshan, @mradmila) that this is worth trying out. @drdarshan has concerns about the maintainability of using the #ifdef.
@ngimel and @eqy raised concerns about performance of the prinfs.

Are there different performance tests I can run that would definitively answer the performance question? What is the impact if we modify one of the kernels and then, say, we see a real regression? I hope we collectively agree on a path forward. Thanks!

ngimel · 2025-09-08T18:31:50Z

My position is that adding even potential perf regressions (and increasing register count is definitely a potential performance regression) for low signal debug prints is never worth it. From current asserts we are already getting which kernel asserted, which assert fired, we don't know the exact values (e.g. we had index 32 and maximum allowable size is 31) but that rarely helps - some previous kernel, not even the one throwing assertion, produced invalid data.

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

facebook-github-bot · 2025-09-11T02:23:51Z

This pull request was exported from Phabricator. Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: Pull Request resolved: pytorch#160129 This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to D77904753. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice, however (see "benchmarks" section). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 Test Plan: This is a minimal test case where I instrumented Repeat.cu and ran the following: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` The updated check in Repeat.cu is as follows (D81807360 for Meta folks): ``` CUDA_KERNEL_ASSERT_PRINTF( result_size == cumsum_ptr[size - 1], "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", result_size, cumsum_ptr[size - 1 ] ); ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` The printf message includes block and thread IDs, plus the exact values that were compared in the assertion. Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

facebook-github-bot · 2025-09-12T19:25:46Z

@mjkatmeta has exported this pull request. If you are a Meta employee, you can view the originating diff in D79310684.

mjkatmeta · 2025-09-15T18:14:15Z

@ngimel, @drdarshan, @janeyx99 - based on the side discussions we've had, we've agreed to instrument only non-performance-critical CUDA kernels, such as repeat_interleave (in Repeat.cu). This PR is updated to:

Add the macro
Instrument repeat_interleave in Repeat.cu.

Let me know if there are any other changes you'd like to see.

mjkatmeta · 2025-09-15T22:04:24Z

@pytorchbot merge

pytorchmergebot · 2025-09-15T22:06:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel

mjkatmeta requested review from eqy, janeyx99 and syed-ahmed as code owners August 7, 2025 19:11

facebook-github-bot added the fb-exported label Aug 7, 2025

Skylion007 reviewed Aug 8, 2025

View reviewed changes

eqy reviewed Aug 8, 2025

View reviewed changes

mjkatmeta force-pushed the export-D79310684 branch from 9329372 to ccd35df Compare August 8, 2025 21:50

mjkatmeta force-pushed the export-D79310684 branch 2 times, most recently from 4ed84d8 to 76b551c Compare August 15, 2025 18:59

mjkatmeta force-pushed the export-D79310684 branch from 76b551c to 796efcb Compare August 15, 2025 19:22

mjkatmeta force-pushed the export-D79310684 branch from 796efcb to 4046382 Compare August 15, 2025 19:22

mjkatmeta force-pushed the export-D79310684 branch from 4f53ea7 to e27bf15 Compare August 26, 2025 04:46

mjkatmeta force-pushed the export-D79310684 branch from e27bf15 to b08cc82 Compare September 5, 2025 20:20

mjkatmeta force-pushed the export-D79310684 branch from b08cc82 to 75a4d31 Compare September 11, 2025 02:18

mjkatmeta force-pushed the export-D79310684 branch from 75a4d31 to e727e3b Compare September 11, 2025 02:23

mjkatmeta force-pushed the export-D79310684 branch from e727e3b to 5814ce2 Compare September 12, 2025 19:25

facebook-github-bot added the meta-exported label Sep 12, 2025

ngimel approved these changes Sep 15, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2025

pytorchmergebot added the merging label Sep 15, 2025

pytorchmergebot added the Merged label Sep 16, 2025

pytorchmergebot closed this in e900a27 Sep 16, 2025

pytorchmergebot removed the merging label Sep 16, 2025

Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG #160129

Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG #160129

Uh oh!

Conversation

mjkatmeta commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Alternatives

Risks/Problems

Benchmarks

Change in generated PTX

with printf

without printf

Uh oh!

pytorch-bot bot commented Aug 7, 2025

Uh oh!

linux-foundation-easycla bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160129

✅ No Failures

Uh oh!

facebook-github-bot commented Aug 7, 2025

Uh oh!

mjkatmeta commented Aug 7, 2025

Uh oh!

janeyx99 commented Aug 7, 2025

Uh oh!

mjkatmeta commented Aug 7, 2025

Uh oh!

janeyx99 commented Aug 8, 2025

Uh oh!

Skylion007 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

mjkatmeta Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

eqy Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

mjkatmeta Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 8, 2025

Uh oh!

facebook-github-bot commented Aug 15, 2025

Uh oh!

mjkatmeta commented Aug 15, 2025

Uh oh!

facebook-github-bot commented Aug 15, 2025

Uh oh!

facebook-github-bot commented Aug 15, 2025

Uh oh!

facebook-github-bot commented Aug 26, 2025

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

mjkatmeta commented Sep 5, 2025

Uh oh!

ngimel commented Sep 8, 2025

Uh oh!

facebook-github-bot commented Sep 11, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

mjkatmeta commented Sep 15, 2025

Uh oh!

mjkatmeta commented Sep 15, 2025

Add `CUDA_KERNEL_ASSERT_PRINTF`, a more flexible `CUDA_KERNEL_ASSERT_MSG` #160129

Add `CUDA_KERNEL_ASSERT_PRINTF`, a more flexible `CUDA_KERNEL_ASSERT_MSG` #160129

mjkatmeta commented Aug 7, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 7, 2025 •

edited

Loading

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading