KEMBAR78
Add option in data loader for out of order data by michael-diggin · Pull Request #141833 · pytorch/pytorch · GitHub
Skip to content

Conversation

@michael-diggin
Copy link
Contributor

@michael-diggin michael-diggin commented Dec 1, 2024

Fixes #105203

Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.

This PR adds an allow_out_of_order bool input to the DataLoader class, defaulting to false, which when set to true will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in _task_info is deleted. The main changes are they to check that an entry in _task_info does exist, and only increasing self._rcvd_idx when the lowest index remaining gets returned.

Two tests are added to test this for iterable type datasets and index type datasets.

cc @andrewkho @divyanshk @ssnl @VitalyFedyunin @dzhulgakov

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141833

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f8a09cc with merge base 30d907c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: dataloader release notes category label Dec 1, 2024
Copy link
Contributor

@andrewkho andrewkho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature! After a first pass it looks fine to me, just a couple of small changes/suggestions

@michael-diggin
Copy link
Contributor Author

Thanks for adding this feature! After a first pass it looks fine to me, just a couple of small changes/suggestions

Thanks for the quick review @andrewkho! I've made the changes you've suggested.

@andrewkho andrewkho added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Dec 3, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue here that occurred to me: since work is continuously being distributed to workers in round-robin fashion, we can hit a scenario where one slow or blocked worker can end up holding all the tasks, blocking other tasks from being scheduled.

As a follow up, when in_order is False, we should also distribute work to workers to keep them balanced since reproducibility is already given up in these scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That issue occurs currently with in_order=True I think (out of order tasks are buffered up instead of being given out)?
Would like this added as a warning, and fixed in a later PR, or fixed as part of this PR?
My guess would be to track the number of tasks given to each worker and check here if it's < the prefetch factor.

Copy link
Contributor

@andrewkho andrewkho Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fine to do in a separate PR, given this is default-off, although I think we'd like it to land together in the next release. Are you up to do a separate PR for that? We can brainstorm the best way to get it working reliably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I'm very happy to do the follow up PR!

@andrewkho
Copy link
Contributor

Please wait for CI to pass and then land

@michael-diggin
Copy link
Contributor Author

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2024
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased out-of-order-dataloader onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout out-of-order-dataloader && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@michael-diggin
Copy link
Contributor Author

michael-diggin commented Dec 4, 2024

Ah sorry @andrewkho, I completely forgot that running merge with -r would require workflows that need your approval. Would you be able to approve them once you get a chance?

@michael-diggin
Copy link
Contributor Author

Hi @andrewkho - small bump on this. Would you be able to approve the workflows/CI?
There’s not been any changes since you approved the PR before, just an accidental rebase that requires rerunning the workflows. Thanks!

@andrewkho
Copy link
Contributor

@michael-diggin sorry for delay, just kicked off the workflows, thanks for your patience!

@michael-diggin
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
Fixes pytorch#105203

Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.

This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned.

Two tests are added to test this for iterable type datasets and index type datasets.

Pull Request resolved: pytorch#141833
Approved by: https://github.com/andrewkho
pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2025
Fixes #105203 and is a follow up PR to #141833

When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding.
The current default behaviour is left as is.

Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky
```bash
for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done
```

Pull Request resolved: #142324
Approved by: https://github.com/andrewkho
@michael-diggin michael-diggin deleted the out-of-order-dataloader branch January 22, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: dataloader Related to torch.utils.data.DataLoader and Sampler open source release notes: dataloader release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pytorch dataloader not loading first-available data with multiple workers

4 participants