KEMBAR78
ps sparse rpc by gcramer23 · Pull Request #58003 · pytorch/pytorch · GitHub
Skip to content

Conversation

@gcramer23
Copy link
Contributor

@gcramer23 gcramer23 commented May 10, 2021

Stack from ghstack:

adds trainer class DdpTrainer
adds trainer class DdpSparseRpcTrainer
adds server class ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

Differential Revision: D29379696

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 10, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 10, 2021

💊 CI failures summary and remediations

As of commit 0b479e7 (more details on the Dr. CI page and at hud.pytorch.org/pr/58003):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .azure_pipelines/pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-job-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-job-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-win.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-win.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-unix.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-unix.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .azure_pipelines/pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-job-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-job-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-win.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-win.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-unix.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-unix.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

gcramer23 added a commit that referenced this pull request May 10, 2021
ghstack-source-id: 68d91b0
Pull Request resolved: #58003
[ghstack-poisoned]
gcramer23 added a commit that referenced this pull request May 11, 2021
ghstack-source-id: 5569bbe
Pull Request resolved: #58003
@gcramer23 gcramer23 mentioned this pull request May 11, 2021
[ghstack-poisoned]
dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 12, 2021
ghstack-source-id: b4f2a86
Pull Request resolved: pytorch#58003
gcramer23 added 2 commits May 13, 2021 03:20
[ghstack-poisoned]
[ghstack-poisoned]
@gcramer23 gcramer23 marked this pull request as ready for review May 14, 2021 01:03
Copy link
Contributor Author

@gcramer23 gcramer23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested before with workarounds with everything passing. I can't test until the pull request for the lock is merged and the issue for RuntimeError is closed.

):
super().__init__(rank)

self.lock = threading.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull request to fix issue with instance lock #57943

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change anything, just FYI. As a temporary solution, IIUC, we can add a __getstates__ to tell avoid pickling RRefs.

def __getstates__():
    return {}

@gcramer23 gcramer23 requested review from H-Huang and mrshenli May 14, 2021 01:08
adds base trainer class RpcTrainerBase
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase and AverageParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
adds base trainer class RpcTrainerBase
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase and AverageParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
@gcramer23
Copy link
Contributor Author

gcramer23 commented May 23, 2021

issue for sparse tensor storage access when using CUDA RPC #58755

EDIT

pull request for bug #59609

gcramer23 added 3 commits May 23, 2021 00:23
adds base trainer class RpcTrainerBase
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase and AverageParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
adds trainer class DdpTrainer
adds trainer class DdpRpcTrainer
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
adds trainer class DdpTrainer
adds trainer class DdpRpcTrainer
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
gcramer23 added 2 commits May 29, 2021 05:17
adds trainer class DdpTrainer
adds trainer class DdpRpcTrainer
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
adds trainer class DdpTrainer
adds trainer class DdpRpcTrainer
adds trainer class DdpSparseRpcTrainer
adds base server classes ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

[ghstack-poisoned]
@gcramer23
Copy link
Contributor Author

@gcramer23 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gcramer23 merged this pull request in 4ed2d5d.

@facebook-github-bot facebook-github-bot deleted the gh/gcramer23/9/head branch June 28, 2021 14:17
asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
Summary:
Pull Request resolved: pytorch#58003

adds trainer class DdpTrainer
adds trainer class DdpSparseRpcTrainer
adds server class ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29379696

Pulled By: gcramer23

fbshipit-source-id: 9cf5fb7398ba2fa3eb694afbddc4ed00d97f205f
asuhan pushed a commit that referenced this pull request Jun 30, 2021
Summary:
Pull Request resolved: #58003

adds trainer class DdpTrainer
adds trainer class DdpSparseRpcTrainer
adds server class ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29379696

Pulled By: gcramer23

fbshipit-source-id: 9cf5fb7398ba2fa3eb694afbddc4ed00d97f205f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants