Slightly improve error message from repeat_interleave kernel #157996

drdarshan · 2025-07-10T02:47:54Z

Summary:
In many investigations relating to invalid feature values, the three-argument form of repeat_interleave currently prints the following message if there is an inconsistency between sum(repeats) and output_size:

Assertion `result_size == cumsum_ptr[size - 1]` failed.

This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q

Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).

In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.

What are my concerns with this change?

Outputs from __assert_fail go to stderr whereas printf writes to stdout. This is not the usual debugging flow where all logs can be found in stderr. I could not find a way to redirect printf to stderr or __assert_fail to stdout
Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside __assert_fail.
I have not seen this pattern in other PyTorch kernels but repeat_interleave with three arguments seems special in other ways too.

Test Plan:

Built an ephemeral package with my changes:
https://www.internalfb.com/intern/servicelab/build/736441058/
Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg
I will export to GH and run CI/CD tests.

Rollback Plan:
steps:

manual.note:
content: >-
Just reverting this diff should be sufficient. Since this change is in
CUDA kernels, I do not believe there is a way to change the error
message via a JK.

Reviewed By: mradmila

Differential Revision: D77904753

Summary: In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`: ``` Assertion `result_size == cumsum_ptr[size - 1]` failed. ``` This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q ``` Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687). ``` In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features. ## What are my concerns with this change? 1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout 2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`. 3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too. Test Plan: * Built an ephemeral package with my changes: https://www.internalfb.com/intern/servicelab/build/736441058/ * Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg * I will export to GH and run CI/CD tests. Rollback Plan: steps: - manual.note: content: >- Just reverting this diff should be sufficient. Since this change is in CUDA kernels, I do not believe there is a way to change the error message via a JK. Reviewed By: mradmila Differential Revision: D77904753

pytorch-bot · 2025-07-10T02:47:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157996

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 815e7de with merge base 3404c1f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-07-10T02:48:03Z

This pull request was exported from Phabricator. Differential Revision: D77904753

Skylion007 · 2025-07-10T13:58:41Z

aten/src/ATen/native/cuda/Repeat.cu

+    printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
+      "Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",
+      __FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);
+    CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1])


Suggested change

printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "

"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",

__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);

CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1])

CUDA_KERNEL_ASSERT_MSG(result_size == cumsum_ptr[size - 1]), fmt::printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "

"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",

__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);

A bit cleaner using fmtlib here

Much nicer, let me try this out, thank you!

@Skylion007 - I just tried this and looks like fmt cannot be used inside device code..

error: calling a __host__ function("int fmt::v9::printf<char [148], long, long , (int)0> (const T1 &, const T2 &...)") from a __global__ function("compute_cuda_kernel<int> ") is not allowed

Would you mind providing some examples of how messages are formatted inside kernels in PyTorch? I looked for some and it seems we generally only print static strings. Thank you.

Ah... :(

pytorch/c10/macros/Macros.h

Line 372 in 7caf6c8

// TODO: This doesn't assert the message because I (chilli) couldn't figure out

I believe that CUDA_KERNEL_ASSERT_MSG does work on some platforms but needs to have the message pre-allocated. To do this required me to reimplement something like sprintf in device code which seemed a bit more effort.. The vanilla printf does seem to be acceptable for some use-cases.. Thank you.

ngimel · 2025-07-11T23:54:06Z

aten/src/ATen/native/cuda/Repeat.cu

    int64_t result_size) {
-  CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
+  if (C10_UNLIKELY((result_size != cumsum_ptr[size - 1]))) {
+    printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "


nit: this kernel is launched with 1d grid and block, so no point in printing .y and .z fields

drdarshan · 2025-07-14T17:13:52Z

@eqy , @syed-ahmed - would you have any feedback on this PR? And if not, would you mind approving it? Looks like it needs your approval to ship. Thank you!

eqy · 2025-07-14T17:47:47Z

FYI, approval from ngimel should be enough to merge

eqy · 2025-07-14T17:47:52Z

@pytorchmergebot merge

pytorchmergebot · 2025-07-14T17:49:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…filter in torchrun Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995

…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684

…MSG` (#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to #157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: #160129 Approved by: https://github.com/ngimel

…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel

…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995

…filter in torchrun (pytorch#160712) Summary: Pull Request resolved: pytorch#160712 Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995

…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995

…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: fduwjj, mradmila Differential Revision: D80188995

…filter in torchrun (#160712) Summary: Part of an effort to extract some important error logs (e.g. [#157996](#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995 Pull Request resolved: #160712 Approved by: https://github.com/fduwjj

drdarshan requested review from eqy and syed-ahmed as code owners July 10, 2025 02:47

pytorch-bot bot added the release notes: cuda release notes category label Jul 10, 2025

facebook-github-bot added the fb-exported label Jul 10, 2025

Skylion007 reviewed Jul 10, 2025

View reviewed changes

ngimel approved these changes Jul 11, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2025

eqy approved these changes Jul 14, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 14, 2025

pytorchmergebot closed this in 5b10b0a Jul 14, 2025

pytorchmergebot added Merged and removed merging labels Jul 14, 2025

cnphil mentioned this pull request Aug 15, 2025

[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun #160712

Closed

mjkatmeta mentioned this pull request Aug 18, 2025

Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG #160129

Closed

Slightly improve error message from repeat_interleave kernel #157996

Slightly improve error message from repeat_interleave kernel #157996

Uh oh!

Conversation

drdarshan commented Jul 10, 2025

What are my concerns with this change?

Uh oh!

pytorch-bot bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157996

✅ No Failures

Uh oh!

facebook-github-bot commented Jul 10, 2025

Uh oh!

Skylion007 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

drdarshan Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

drdarshan Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

drdarshan Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

drdarshan commented Jul 14, 2025

Uh oh!

eqy commented Jul 14, 2025

Uh oh!

eqy commented Jul 14, 2025

Uh oh!

pytorchmergebot commented Jul 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading