Add option in data loader for out of order data #141833

michael-diggin · 2024-12-01T13:55:37Z

Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.

This PR adds an allow_out_of_order bool input to the DataLoader class, defaulting to false, which when set to true will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in _task_info is deleted. The main changes are they to check that an entry in _task_info does exist, and only increasing self._rcvd_idx when the lowest index remaining gets returned.

Two tests are added to test this for iterable type datasets and index type datasets.

cc @andrewkho @divyanshk @ssnl @VitalyFedyunin @dzhulgakov

pytorch-bot · 2024-12-01T13:55:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141833

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f8a09cc with merge base 30d907c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

andrewkho

Thanks for adding this feature! After a first pass it looks fine to me, just a couple of small changes/suggestions

torch/utils/data/dataloader.py

test/test_dataloader.py

torch/utils/data/dataloader.py

michael-diggin · 2024-12-02T23:24:39Z

Thanks for adding this feature! After a first pass it looks fine to me, just a couple of small changes/suggestions

Thanks for the quick review @andrewkho! I've made the changes you've suggested.

andrewkho · 2024-12-03T18:13:11Z

torch/utils/data/dataloader.py

One potential issue here that occurred to me: since work is continuously being distributed to workers in round-robin fashion, we can hit a scenario where one slow or blocked worker can end up holding all the tasks, blocking other tasks from being scheduled.

As a follow up, when in_order is False, we should also distribute work to workers to keep them balanced since reproducibility is already given up in these scenarios.

Yep! That issue occurs currently with in_order=True I think (out of order tasks are buffered up instead of being given out)?
Would like this added as a warning, and fixed in a later PR, or fixed as part of this PR?
My guess would be to track the number of tasks given to each worker and check here if it's < the prefetch factor.

I think it'd be fine to do in a separate PR, given this is default-off, although I think we'd like it to land together in the next release. Are you up to do a separate PR for that? We can brainstorm the best way to get it working reliably

Yep I'm very happy to do the follow up PR!

andrewkho · 2024-12-03T21:17:45Z

Please wait for CI to pass and then land

michael-diggin · 2024-12-04T07:56:20Z

@pytorchbot merge -r

pytorchmergebot · 2024-12-04T07:58:04Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-04T07:58:06Z

Successfully rebased out-of-order-dataloader onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout out-of-order-dataloader && git pull --rebase)

pytorchmergebot · 2024-12-04T07:59:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-04T07:59:33Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

michael-diggin · 2024-12-04T08:13:14Z

Ah sorry @andrewkho, I completely forgot that running merge with -r would require workflows that need your approval. Would you be able to approve them once you get a chance?

michael-diggin · 2024-12-06T16:56:40Z

Hi @andrewkho - small bump on this. Would you be able to approve the workflows/CI?
There’s not been any changes since you approved the PR before, just an accidental rebase that requires rerunning the workflows. Thanks!

andrewkho · 2024-12-06T16:59:32Z

@michael-diggin sorry for delay, just kicked off the workflows, thanks for your patience!

michael-diggin · 2024-12-06T19:48:26Z

@pytorchbot merge

pytorchmergebot · 2024-12-06T19:50:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#105203 Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups. This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order. Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned. Two tests are added to test this for iterable type datasets and index type datasets. Pull Request resolved: pytorch#141833 Approved by: https://github.com/andrewkho

Fixes #105203 and is a follow up PR to #141833 When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work. In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding. The current default behaviour is left as is. Tests are also updated to assert on the worker IDs for each sample of data returned. I've run the following to confirm they aren't flaky ```bash for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done ``` Pull Request resolved: #142324 Approved by: https://github.com/andrewkho

michael-diggin requested review from andrewkho and divyanshk as code owners December 1, 2024 13:55

pytorch-bot bot added the release notes: dataloader release notes category label Dec 1, 2024

pytorchbot added the open source label Dec 1, 2024

andrewkho suggested changes Dec 2, 2024

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

test/test_dataloader.py Outdated Show resolved Hide resolved

andrewkho reviewed Dec 2, 2024

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

andrewkho added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Dec 3, 2024

andrewkho reviewed Dec 3, 2024

View reviewed changes

andrewkho approved these changes Dec 3, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2024

michael-diggin added 2 commits December 4, 2024 07:58

Add option in data loader for out of order data

96d26e2

change to in_order

f8a09cc

pytorchmergebot force-pushed the out-of-order-dataloader branch from 8370dae to f8a09cc Compare December 4, 2024 07:58

pytorchmergebot added the merging label Dec 4, 2024

pytorchmergebot removed the merging label Dec 4, 2024

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in 18ef3a0 Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

michael-diggin mentioned this pull request Dec 8, 2024

Dataloader distribute tasks to workers when in_order is False #142324

Closed

divyanshk mentioned this pull request Dec 12, 2024

add benchmark imagenet code example meta-pytorch/data#1391

Merged

andrewkho mentioned this pull request Dec 26, 2024

[Stateful DL] Pre-emptive: ensure compatibility with out-of-order updates to torch.utils.data.DataLoader meta-pytorch/data#1414

Closed

michael-diggin mentioned this pull request Jan 20, 2025

[Stateful DL] Add out of order implementation meta-pytorch/data#1423

Merged

michael-diggin deleted the out-of-order-dataloader branch January 22, 2025 20:15

Add option in data loader for out of order data #141833

Add option in data loader for out of order data #141833

Uh oh!

Conversation

michael-diggin commented Dec 1, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141833

✅ No Failures

Uh oh!

andrewkho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michael-diggin commented Dec 2, 2024

Uh oh!

andrewkho Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

michael-diggin Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

andrewkho Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-diggin Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

andrewkho commented Dec 3, 2024

Uh oh!

michael-diggin commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge failed

Uh oh!

michael-diggin commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-diggin commented Dec 6, 2024

Uh oh!

andrewkho commented Dec 6, 2024

Uh oh!

michael-diggin commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michael-diggin commented Dec 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 1, 2024 •

edited

Loading

andrewkho Dec 3, 2024 •

edited

Loading

michael-diggin commented Dec 4, 2024 •

edited

Loading