added zbv_algorithm #138444

haocizhang · 2024-10-21T08:04:57Z

Added ZBV algorithm to pp schedules. See https://arxiv.org/pdf/2401.10241 for details.

Tested schedule using python test_schedules.py

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-10-21T08:05:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138444

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit ecbcbb4 with merge base e7ec294 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/pipelining/schedules.py:
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh)
kernels/quantized/test/test_out_variants.py::TestOutVariants::test_quantize_per_tensor_to_out_variant

This comment was automatically generated by Dr. CI and updates every 15 minutes.

haocizhang · 2024-10-22T02:37:33Z

@pytorchbot label "release notes: distributed (pipeline)"

haocizhang · 2024-10-22T02:41:18Z

@pytorchbot label "topic: not user facing"

wconstab · 2024-10-23T01:24:18Z

torch/distributed/pipelining/schedules.py

+            count = []
+            for i in range(pipeline_parallel_size):
+                count.append([0] * 6)
+            fbw_mem = [39, -7, -32]


comment? what are these numbers

wconstab · 2024-10-23T01:24:34Z

torch/distributed/pipelining/schedules.py

+        def get_compute_schedule(pipeline_parallel_size, num_microbatches):
+            n_node = 6 * pipeline_parallel_size * num_microbatches
+
+            def get_id(cat, chunk, rank, micro):


comment what 'id' is for?

wconstab · 2024-10-23T01:24:50Z

torch/distributed/pipelining/schedules.py

+        compute_schedules = {}
+
+        def get_compute_schedule(pipeline_parallel_size, num_microbatches):
+            n_node = 6 * pipeline_parallel_size * num_microbatches


comment what is 6?

wconstab · 2024-10-23T01:28:09Z

torch/distributed/pipelining/schedules.py

+                schedule[i] = []
+            stage_str = ["    " * i for i in range(pipeline_parallel_size)]
+            approved_bubble = [-1] * pipeline_parallel_size
+            max_approved_bubble = max(approved_bubble)


is max_approved_bubble always -1?

wconstab · 2024-10-23T01:29:18Z

torch/distributed/pipelining/schedules.py

+            approved_bubble = [-1] * pipeline_parallel_size
+            max_approved_bubble = max(approved_bubble)
+
+            def get_max_rank_bubble(rank=-1):


im pretty confused by this helper. it looks like it should do something complex, but it seems like all its inputs are static and it should return 0 all the time or something.

wconstab · 2024-10-23T01:30:04Z

torch/distributed/pipelining/schedules.py

+                _tmp = _no_bubble = cur_time[rank] + 1
+                _cnt = count[rank][cat * 2 + chunk]
+                stage_str[rank] += (
+                    "FfBbWw"[cat * 2 + chunk]


what is this?

wconstab · 2024-10-23T01:31:24Z

torch/distributed/pipelining/schedules.py

+                end_time[_id] = _tmp
+                cur_time[rank] = _tmp
+                mem[rank] += fbw_mem[cat]
+                # noinspection PyTypeChecker


we actually should not disable type checker, we should add the mypy hints everywhere and make sure its correct. helps catch some bugs.

wconstab · 2024-10-23T01:32:14Z

torch/distributed/pipelining/schedules.py

+                _, chunk_, _ = pending_w[rank].popleft()
+                put(2, chunk_, rank)
+
+            def put(cat, chunk, rank, assert_cnt=True):


would help to have description of what 'cat' means, especially. maybe chunk is intuitive but idk yet.

H-Huang

_get_zbv_schedule() is confusing to me, I need some time to read over the paper and digest this implementation more

H-Huang · 2024-11-11T21:29:09Z

torch/distributed/pipelining/schedules.py

Can you include a short description of the differences between ZBV and the existing InterleavedZeroBubble schedule?

H-Huang · 2024-11-11T21:31:33Z

torch/distributed/pipelining/schedules.py

What are the differences between this fn and

pytorch/torch/distributed/pipelining/schedules.py

Line 2100 in 780b28f

def _add_bubbles_to_actions(self, num_stages_global):

, could you comment it?

Removed the duplicated function.

@H-Huang IIUC this function is only needed so that the zbv schedule can be compatible with the preexisting PP runtime in PipelineScheduleMulti. If we consolidate on the newer PipelineScheduleRuntime class, we don't have to add bubbles anymore and we can simplify our schedule IR generation.

wconstab · 2024-11-11T23:27:44Z

torch/distributed/pipelining/schedules.py

@haocizhang did you rebase? A couple weeks ago I landed some PRs to support dW/dI runner and i also renamed the IR, I think its BACKWARD_WEIGHT, BACKWARD_INPUT, FULL_BACKWARD now.

Yeah realized I didn't rebase :) Rebased and updated the PR

wconstab · 2024-11-12T18:08:18Z

torch/distributed/pipelining/schedules.py

+            I: 1,
+            W: 2,
+        }
+        chunk_0 = 0


does zbv hardcode that there are exactly 2 model-chunks per PP rank? should we make that more of an assertion if so?

H-Huang · 2024-11-12T18:28:26Z

cc @ufotalent who is one the zero bubble paper authors.

This PR implements the ZBV variant of zero bubble:

Is there a simpler heuristic which we can use to guide the ordering of F-B-W for each device, regardless of # of ranks and # of stages?

wconstab · 2024-11-12T18:36:15Z

torch/distributed/pipelining/schedules.py

+                category = category_map[op]
+                # Number of ops (F/B/W) with the same (action, chunk) on current rank
+                _op_count = ops_count[rank][op][chunk]
+                if chunk == chunk_1 and op in (F, I):


maybe these 2 ifs would make for a nice helper function whose name would further clarify their purpose. IIUC the idea here is to figure out what the earliest time this op/chunk can run on the current rank, given the time that its dependency is scheduled on another rank?

"get_earliest_time_based_on_dependency(op, chunk, rank)"

wconstab · 2024-11-12T18:40:23Z

torch/distributed/pipelining/schedules.py

+            num_chunks = 2
+            n_node = len(category_map) * num_chunks * pipeline_parallel_size * num_microbatches
+
+            def get_id(op, chunk, rank, microbatch_id):


iiuc the purpose of this helper is to hash the op uniquely? i think you could just directly key a dict off the op itself and achieve the same thing?

wconstab · 2024-11-12T18:46:16Z

torch/distributed/pipelining/schedules.py

+                # For BACKWARD and WEIGHT operation, we will schedule chunk 1 before 0 so
+                # inversing the order before adding to the schedule
+                temp_chunk = chunk if op == F else 1 - chunk
+                schedule[rank].append((_op_count, op, temp_chunk))


nit: could you directly construct an _Action() object here as you append, instead of creating a similar but not the same representation? Otoh I understand that for the logic here it is convenient to refer to local chunk ID (0,1) rather than global stage_id which combines local chunk_id with rank. IIUC op_count is the same as microbatch_id?

wconstab · 2024-11-12T18:48:53Z

torch/distributed/pipelining/schedules.py

+
+            fbw_mem = [3, -1, -2]
+            max_mem = 3 * (pipeline_parallel_size * 2)
+            end_time = [-1] * n_node


as i read through the code below, i realize i'm confused between cur_time and end_time. Maybe it will become more clear..

wconstab · 2024-11-12T18:57:09Z

torch/distributed/pipelining/schedules.py

+                put(F, chunk_1, cur_rank)
+
+            iter_chunk_ = 0
+            # Ensure forward operation synchronization across pipeline stages


define 'synchronization'? at first i thought this was aiming to ensure the same number of microbatches per chunk, but that's not quite what the logic below does, it seems more like ensuring the same number of actions per rank but not necessarily same number of chunk0 vs chunk1

edit: ok i was confused bc i only looked at the 'while' logic. but the if condition in the for loop looks like it does ensure all F's are scheduled for both chunks.

wconstab · 2024-11-12T18:59:43Z

torch/distributed/pipelining/schedules.py

+                    + ops_count[current_rank][F][chunk_1]
+                    < ops_count[previous_rank][F][chunk_0]
+                    + ops_count[previous_rank][F][chunk_1]
+                    or ops_count[current_rank][F][chunk_1]


the logic after the or is confusing to me. Why would a 'previous rank' ever have more chunk1's scheduled than the current rank?

wconstab · 2024-11-12T19:09:48Z

torch/distributed/pipelining/schedules.py

+                for rank in range(pipeline_parallel_size):
+                    chunk_0_ops = ops_count[rank][I][chunk_0]
+                    chunk_1_ops = ops_count[rank][I][chunk_1]
+                    if chunk_1_ops >= chunk_0_ops:


chunk1_ops == chunk0_ops doesn't make sense to me here. Only for the last rank would chunk1 op directly unblock chunk0 op. For other ranks, should the logic for chunk0 ready be depenent on the 'dependency_id' condition below?

wconstab · 2024-11-12T19:12:56Z

torch/distributed/pipelining/schedules.py

+                # Schedule backward operations for each rank
+                for rank, chunk in scheduled_ranks:
+                    dependency_id = -1
+                    if chunk == chunk_1 and rank < pipeline_parallel_size - 1:


i must be confusing myself, but chunk1 on rank0 would be the first chunk1 B to run wouldn't it? so then chunk1 on rank1 would have a dependency on chunk1 on rank0? so dependency_id should have rank - 1 instead of rank + 1? And vice-versa for the chunk0 logic below?

QPHutu · 2024-11-13T07:27:17Z

cc @ufotalent who is one the zero bubble paper authors.

This PR implements the ZBV variant of zero bubble:

Is there a simpler heuristic which we can use to guide the ordering of F-B-W for each device, regardless of # of ranks and # of stages?

@H-Huang IIUC, you only want a handcrafted ZB-V schedule here. If so, you don't need any heuristic/greedy methods, a deterministic rule/pattern can be used to directly generate ZB-V schedule.

In the zero-bubble paper, we implemented a complicated greedy method based on profiled $$T_F, T_B, T_W$$ to minimize the bubble caused by the inequality of these running times. However, in this code, $$T_F, T_B, T_W$$ are hardcoded as 1, which means you don't need to implement our greedy method. A specific pattern should work for you. Please refer to this handcrafted ZB-V implementation.

Additionally, we have another version of implementation on ZB-V in another paper NeuIPS 2024 and code, which is conceptually simpler than our previous greedy method. If you also want an adaptive version given running times/memories as inputs, maybe we can help to simplify the implementation (current implementation also supports other schedules like V-Half).

To support ZB-V in native pytorch pytorch/pytorch#138444

pytorch/pytorch#138444

pytorch/pytorch#138444 (comment)

Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for #138444 cc the original authors: QPHutu ufotalent #138444 (comment) cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

@QPHutu

Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for #138444 cc the original authors: @QPHutu @ufotalent #138444 (comment) Pull Request resolved: #142084 Approved by: https://github.com/kwen2501

@QPHutu

Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for pytorch#138444 cc the original authors: @QPHutu @ufotalent pytorch#138444 (comment) Pull Request resolved: pytorch#142084 Approved by: https://github.com/kwen2501

@QPHutu

Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for pytorch#138444 cc the original authors: @QPHutu @ufotalent pytorch#138444 (comment) Pull Request resolved: pytorch#142084 Approved by: https://github.com/kwen2501

@QPHutu

Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for pytorch#138444 cc the original authors: @QPHutu @ufotalent pytorch#138444 (comment) Pull Request resolved: pytorch#142084 Approved by: https://github.com/kwen2501

github-actions · 2025-01-12T09:33:57Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2024

haocizhang force-pushed the zbv_algo branch from fd96a47 to ca3a136 Compare October 22, 2024 02:35

pytorch-bot bot added the release notes: distributed (pipeline) release notes category label Oct 22, 2024

pytorch-bot bot added the topic: not user facing topic category label Oct 22, 2024

haocizhang force-pushed the zbv_algo branch 2 times, most recently from c24df0c to 4f576b9 Compare October 22, 2024 04:00

wconstab reviewed Oct 23, 2024

View reviewed changes

haocizhang force-pushed the zbv_algo branch 2 times, most recently from d43b4aa to f00c129 Compare November 11, 2024 19:54

added zbv_algorithm

7304535

H-Huang reviewed Nov 11, 2024

View reviewed changes

wconstab reviewed Nov 11, 2024

View reviewed changes

added zbv_algorithm

ecbcbb4

haocizhang force-pushed the zbv_algo branch from f00c129 to ecbcbb4 Compare November 12, 2024 03:38

wconstab reviewed Nov 12, 2024

View reviewed changes

QPHutu added a commit to sail-sg/zero-bubble-pipeline-parallelism that referenced this pull request Nov 13, 2024

Add handcrafted_zb_v.py

106e470

To support ZB-V in native pytorch pytorch/pytorch#138444

QPHutu added a commit to sail-sg/zero-bubble-pipeline-parallelism that referenced this pull request Nov 13, 2024

To support ZB-V in native pytorch

81f5d87

pytorch/pytorch#138444

QPHutu added a commit to sail-sg/zero-bubble-pipeline-parallelism that referenced this pull request Nov 14, 2024

support ZB-V in native pytorch

93af65c

pytorch/pytorch#138444 (comment)

H-Huang mentioned this pull request Dec 5, 2024

[pipelining] Add ZBV schedule #142084

Closed

github-actions bot added the Stale label Jan 12, 2025

github-actions bot closed this Feb 11, 2025

github-actions bot deleted the zbv_algo branch March 14, 2025 02:07

added zbv_algorithm #138444

added zbv_algorithm #138444

Uh oh!

Conversation

haocizhang commented Oct 21, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138444

❌ 2 New Failures

Uh oh!

haocizhang commented Oct 22, 2024

Uh oh!

haocizhang commented Oct 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Nov 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QPHutu commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

haocizhang commented Oct 21, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading

wconstab Nov 12, 2024 •

edited

Loading

QPHutu commented Nov 13, 2024 •

edited

Loading