KEMBAR78
Slightly improve error message from repeat_interleave kernel by drdarshan · Pull Request #157996 · pytorch/pytorch · GitHub
Skip to content

Conversation

@drdarshan
Copy link
Contributor

Summary:
In many investigations relating to invalid feature values, the three-argument form of repeat_interleave currently prints the following message if there is an inconsistency between sum(repeats) and output_size:

Assertion `result_size == cumsum_ptr[size - 1]` failed.

This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q

Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).

In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.

What are my concerns with this change?

  1. Outputs from __assert_fail go to stderr whereas printf writes to stdout. This is not the usual debugging flow where all logs can be found in stderr. I could not find a way to redirect printf to stderr or __assert_fail to stdout
  2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside __assert_fail.
  3. I have not seen this pattern in other PyTorch kernels but repeat_interleave with three arguments seems special in other ways too.

Test Plan:

Rollback Plan:
steps:

  • manual.note:
    content: >-
    Just reverting this diff should be sufficient. Since this change is in
    CUDA kernels, I do not believe there is a way to change the error
    message via a JK.

Reviewed By: mradmila

Differential Revision: D77904753

Summary:
In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`:
```
Assertion `result_size == cumsum_ptr[size - 1]` failed.
```

This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q

```
Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).
```

In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features. 

## What are my concerns with this change?
1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout
2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`.
3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too.

Test Plan:
* Built an ephemeral package with my changes: 
https://www.internalfb.com/intern/servicelab/build/736441058/

* Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg

* I will export to GH and run CI/CD tests.

Rollback Plan:
steps:
  - manual.note:
      content: >-
        Just reverting this diff should be sufficient. Since this change is in
        CUDA kernels, I do not believe there is a way to change the error
        message via a JK.

Reviewed By: mradmila

Differential Revision: D77904753
@drdarshan drdarshan requested review from eqy and syed-ahmed as code owners July 10, 2025 02:47
@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Jul 10, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157996

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 815e7de with merge base 3404c1f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77904753

Comment on lines +21 to +24
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1])
CUDA_KERNEL_ASSERT_MSG(result_size == cumsum_ptr[size - 1]), fmt::printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);

A bit cleaner using fmtlib here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer, let me try this out, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 - I just tried this and looks like fmt cannot be used inside device code..

error: calling a __host__ function("int fmt::v9::printf<char [148], long, long , (int)0> (const T1 &, const T2 &...)") from a __global__ function("compute_cuda_kernel<int> ") is not allowed

Would you mind providing some examples of how messages are formatted inside kernels in PyTorch? I looked for some and it seems we generally only print static strings. Thank you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... :(

// TODO: This doesn't assert the message because I (chilli) couldn't figure out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that CUDA_KERNEL_ASSERT_MSG does work on some platforms but needs to have the message pre-allocated. To do this required me to reimplement something like sprintf in device code which seemed a bit more effort.. The vanilla printf does seem to be acceptable for some use-cases.. Thank you.

int64_t result_size) {
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
if (C10_UNLIKELY((result_size != cumsum_ptr[size - 1]))) {
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this kernel is launched with 1d grid and block, so no point in printing .y and .z fields

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2025
@drdarshan
Copy link
Contributor Author

@eqy , @syed-ahmed - would you have any feedback on this PR? And if not, would you mind approving it? Looks like it needs your approval to ship. Thank you!

@eqy
Copy link
Collaborator

eqy commented Jul 14, 2025

FYI, approval from ngimel should be enough to merge

@eqy
Copy link
Collaborator

eqy commented Jul 14, 2025

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cnphil added a commit to cnphil/pytorch that referenced this pull request Aug 15, 2025
…filter in torchrun

Summary:
Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Differential Revision: D80188995
mjkatmeta added a commit to mjkatmeta/pytorch that referenced this pull request Sep 12, 2025
…MSG` (pytorch#160129)

Summary:

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: cnphil, mradmila

Differential Revision: D79310684
pytorchmergebot pushed a commit that referenced this pull request Sep 16, 2025
…MSG` (#160129)

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to #157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: #160129
Approved by: https://github.com/ngimel
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…MSG` (pytorch#160129)

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: pytorch#160129
Approved by: https://github.com/ngimel
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…MSG` (pytorch#160129)

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: pytorch#160129
Approved by: https://github.com/ngimel
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…MSG` (pytorch#160129)

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: pytorch#160129
Approved by: https://github.com/ngimel
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…MSG` (pytorch#160129)

This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: pytorch#160129
Approved by: https://github.com/ngimel
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:
Pull Request resolved: pytorch#160712

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:
Pull Request resolved: pytorch#160712

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 22, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 22, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 22, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2025
…filter in torchrun (#160712)

Summary:
Part of an effort to extract some important error logs (e.g. [#157996](#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Differential Revision: D80188995

Pull Request resolved: #160712
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: cuda release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants