-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Slightly improve error message from repeat_interleave kernel #157996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`: ``` Assertion `result_size == cumsum_ptr[size - 1]` failed. ``` This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q ``` Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687). ``` In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features. ## What are my concerns with this change? 1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout 2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`. 3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too. Test Plan: * Built an ephemeral package with my changes: https://www.internalfb.com/intern/servicelab/build/736441058/ * Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg * I will export to GH and run CI/CD tests. Rollback Plan: steps: - manual.note: content: >- Just reverting this diff should be sufficient. Since this change is in CUDA kernels, I do not believe there is a way to change the error message via a JK. Reviewed By: mradmila Differential Revision: D77904753
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157996
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 815e7de with merge base 3404c1f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D77904753 |
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] " | ||
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", | ||
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]); | ||
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] " | |
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", | |
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]); | |
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]) | |
CUDA_KERNEL_ASSERT_MSG(result_size == cumsum_ptr[size - 1]), fmt::printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] " | |
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n", | |
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]); |
A bit cleaner using fmtlib here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer, let me try this out, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Skylion007 - I just tried this and looks like fmt
cannot be used inside device code..
error: calling a __host__ function("int fmt::v9::printf<char [148], long, long , (int)0> (const T1 &, const T2 &...)") from a __global__ function("compute_cuda_kernel<int> ") is not allowed
Would you mind providing some examples of how messages are formatted inside kernels in PyTorch? I looked for some and it seems we generally only print static strings. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... :(
Line 372 in 7caf6c8
// TODO: This doesn't assert the message because I (chilli) couldn't figure out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that CUDA_KERNEL_ASSERT_MSG
does work on some platforms but needs to have the message pre-allocated. To do this required me to reimplement something like sprintf
in device code which seemed a bit more effort.. The vanilla printf does seem to be acceptable for some use-cases.. Thank you.
int64_t result_size) { | ||
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); | ||
if (C10_UNLIKELY((result_size != cumsum_ptr[size - 1]))) { | ||
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this kernel is launched with 1d grid and block, so no point in printing .y
and .z
fields
@eqy , @syed-ahmed - would you have any feedback on this PR? And if not, would you mind approving it? Looks like it needs your approval to ship. Thank you! |
FYI, approval from ngimel should be enough to merge |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…filter in torchrun Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995
…MSG` (pytorch#160129) Summary: This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: cnphil, mradmila Differential Revision: D79310684
…MSG` (#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to #157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: #160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…MSG` (pytorch#160129) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives * We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to pytorch#157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: pytorch#160129 Approved by: https://github.com/ngimel
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Pull Request resolved: pytorch#160712 Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Pull Request resolved: pytorch#160712 Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: fduwjj, mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: fduwjj, mradmila Differential Revision: D80188995
…filter in torchrun (pytorch#160712) Summary: Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118 Network: Up: 1.0MiB Down: 2.9GiB (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72) Analyzing targets. Remaining 0/5 533 actions, 552 artifacts declared Executing actions. Remaining 0/176 1:22.7s exec time total Command: test. Finished 51 local, 13 remote, 50 cache (44% hit) 19.8s exec time cached (23%) Time elapsed: 1:45.2s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test [DISABLED] ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test ``` Reviewed By: fduwjj, mradmila Differential Revision: D80188995
…filter in torchrun (#160712) Summary: Part of an effort to extract some important error logs (e.g. [#157996](#157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995 Pull Request resolved: #160712 Approved by: https://github.com/fduwjj
Summary:
In many investigations relating to invalid feature values, the three-argument form of
repeat_interleave
currently prints the following message if there is an inconsistency betweensum(repeats)
andoutput_size
:This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q
In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.
What are my concerns with this change?
__assert_fail
go tostderr
whereasprintf
writes tostdout
. This is not the usual debugging flow where all logs can be found instderr
. I could not find a way to redirectprintf
to stderr or__assert_fail
to stdout__assert_fail
.repeat_interleave
with three arguments seems special in other ways too.Test Plan:
Built an ephemeral package with my changes:
https://www.internalfb.com/intern/servicelab/build/736441058/
Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg
I will export to GH and run CI/CD tests.
Rollback Plan:
steps:
content: >-
Just reverting this diff should be sufficient. Since this change is in
CUDA kernels, I do not believe there is a way to change the error
message via a JK.
Reviewed By: mradmila
Differential Revision: D77904753