KEMBAR78
"test_events_wait" is flaky on ROCm · Issue #62602 · pytorch/pytorch · GitHub
Skip to content

"test_events_wait" is flaky on ROCm #62602

@NivekT

Description

@NivekT

🐛 Bug

test_events_wait under pytorch-linux-bionic-rocm4.2-py3.6 seems be a flaky test based on my discussion with @mruberry.

To Reproduce

Steps to reproduce the behavior:

  1. This issue is occasionally reproduced when test_events_wait is ran on Jenkins.

Expected behavior

The test should consistently pass or fail.

Environment

The test is ran with Jenkins in the automated test environment.

Additional context

Here is an except of the console log. An example of the full log is here.

15:45:13 FAIL [0.093s]: test_events_wait (__main__.TestCuda)
15:45:13 ----------------------------------------------------------------------
15:45:13 Traceback (most recent call last):
15:45:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1071, in wrapper
15:45:13     method(*args, **kwargs)
15:45:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1071, in wrapper
15:45:13     method(*args, **kwargs)
15:45:13   File "test_cuda.py", line 1168, in test_events_wait
15:45:13     self.assertTrue(s0.query())
15:45:13 AssertionError: False is not true
15:45:13 
15:45:14 ----------------------------------------------------------------------
15:45:14 Ran 164 tests in 43.224s
15:45:14 
15:45:14 FAILED (failures=1, skipped=22)
15:45:14 
15:45:14 Generating XML reports...
15:45:14 Generated XML report: test-reports/python-unittest/test_cuda/TEST-TestCuda-20210802154430.xml
15:45:14 Generated XML report: test-reports/python-unittest/test_cuda/TEST-TestCudaComm-20210802154430.xml
15:45:16 Traceback (most recent call last):
15:45:16   File "test/run_test.py", line 1092, in <module>
15:45:16     main()
15:45:16   File "test/run_test.py", line 1071, in main
15:45:16     raise RuntimeError(err_message)
15:45:16 RuntimeError: test_cuda failed!
15:45:17 
15:45:17 real	10m30.614s
15:45:17 user	12m59.565s
15:45:17 sys	6m10.787s
15:45:17 + cleanup
15:45:17 + retcode=1
15:45:17 + set +x
15:45:17 =================== sccache compilation log ===================
15:45:17 =========== If your build fails, please take a look at the log above for possible reasons ===========
15:45:17 Compile requests                 0
15:45:17 Compile requests executed        0
15:45:17 Cache hits                       0
15:45:17 Cache misses                     0
15:45:17 Cache timeouts                   0
15:45:17 Cache read errors                0
15:45:17 Forced recaches                  0
15:45:17 Cache write errors               0
15:45:17 Compilation failures             0
15:45:17 Cache errors                     0
15:45:17 Non-cacheable compilations       0
15:45:17 Non-cacheable calls              0
15:45:17 Non-compilation calls            0
15:45:17 Unsupported compiler calls       0
15:45:17 Average cache write          0.000 s
15:45:17 Average cache read miss      0.000 s
15:45:17 Average cache read hit       0.000 s
15:45:17 Cache location             Local disk: "/var/lib/jenkins/.cache/sccache"
15:45:17 Cache size                       0 bytes
15:45:17 Max cache size                  10 GiB
15:45:17 Stopping sccache server...
15:45:17 Compile requests                 0
15:45:17 Compile requests executed        0
15:45:17 Cache hits                       0
15:45:17 Cache misses                     0
15:45:17 Cache timeouts                   0
15:45:17 Cache read errors                0
15:45:17 Forced recaches                  0
15:45:17 Cache write errors               0
15:45:17 Compilation failures             0
15:45:17 Cache errors                     0
15:45:17 Non-cacheable compilations       0
15:45:17 Non-cacheable calls              0
15:45:17 Non-compilation calls            0
15:45:17 Unsupported compiler calls       0
15:45:17 Average cache write          0.000 s
15:45:17 Average cache read miss      0.000 s
15:45:17 Average cache read hit       0.000 s
15:45:17 Cache location             Local disk: "/var/lib/jenkins/.cache/sccache"
15:45:17 Cache size                       0 bytes
15:45:17 Max cache size                  10 GiB
15:45:17 + echo 'Stopping container...'
15:45:17 Stopping container...
15:45:17 + '[' -n '' ']'
15:45:17 + docker rm -f 0f5b4d4e01941a4f7246cef9efdc5352fc45a104cac4a94b64b7a0e998da4c0e
15:45:18 Build step 'Execute shell' marked build as failure
15:45:18 [xUnit] [INFO] - Starting to record.
15:45:18 [xUnit] [INFO] - Processing JUnit
15:45:18 [xUnit] [INFO] - [JUnit] - No test report file(s) were found with the pattern 'test-*.xml' relative to '/var/lib/jenkins/workspace/pytorch-builds/pytorch-linux-bionic-rocm4.2-py3.6-test1' for the testing framework 'JUnit'.  Did you enter a pattern relative to the correct directory?  Did you generate the result report(s) for 'JUnit'?
15:45:18 [xUnit] [WARNING] - No test reports found for the metric 'JUnit' with the resolved pattern 'test-*.xml'.
15:45:18 [xUnit] [INFO] - Skipping the metric tool processing.
15:45:18 [xUnit] [INFO] - There are errors when processing test results.
15:45:18 [xUnit] [INFO] - Skipping tests recording.
15:45:18 [BFA] Scanning build for known causes...
15:45:18 [BFA] No failure causes found
15:45:18 [BFA] Done. 0s
15:45:18 Finished: FAILURE

cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: flaky-testsProblem is a flaky test in CImodule: rocmAMD GPU support for PytorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions