-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix the use of fsspec transactions #135541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135541
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 0618bb5 with merge base 34e4205 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
/easycla |
|
Do we have an issue filed with fsspec, they would probably look into a fix for this (but this is a great temp workaround). If so, please link the issue in the comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically should use 'os.fspath' instead of str but should only matter for more exotic PathLike objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also merge both these with statements into one if you wanted to reduce indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually solve the problem or does it need to be a deep copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure a shallow copy would be safe depending on how transactions are implemented on the backend for the specific fsspec endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I think this is likely safe only if the file system wasn't already in transaction from a 3rd party lib etc. We could add an assert on that boolean, but that wouldn't fully protect from the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, yeah reading the above comment this copy seems fine as it's only a race on the same stream.
|
Looks like fixing the typing in this file also allowed flake8 and ruff to properly parse the types and find a few import issues. |
|
@pradeepfn Please take a look at this PR. |
01956ae to
ec7498c
Compare
|
@Skylion007 : re the copy, I've updated the implementation to be (hopefully) a little less dependent on the exact implementation -- here, we directly use Let me know what you think -- I'm happy to remove the last commit if you think it'd be cleaner without it. |
|
@martindurant Any thoughts on which approach is best? |
|
It is true that fsspec transactions were not designed with parallelism in mind, nor are reentrant. It would take some effort to correct this. Before getting into details, can someone comment on the purpose of transactions in general? Any file can be opened with autocommit=False, and then be cancelled or committed, so you can manage a container of these however you like. |
I admit I don't fully understand the code, but my read from the initial PR (#112191) is just to ensure there aren't half-written files on some kind of failure. So we actually need any concurrency support. The code I pushed does what you suggest, manually using |
Perfect :). Then you shouldn't need to copy any filesystem instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me what is happening here?
It seems like we don't have to handle the same discard/commit semantics in binary mode (?). Can you help me understand why? ( or a link to appropriate reading). thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we don't have to handle the same discard/commit semantics in binary mode (?).
The basic issue is that fs._open is only guaranteed to be implemented for binary mode -- so for text mode we just (recursively) get the binary stream (which has the discard + commit semantics) and then wraps it in an io.TextIOWrapper. The recursion ensures that the binary stream also has the same commit + discard semantics.
Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is right. It would have made sense for fsspec to wrap (heh) TextIOWrapper for "t" modes, to provide the same commit/discard methods.
|
During CP save, the file-system writer makes an atomic operation ( rename) on the checkpoint's .metadata file to mark the checkpoint as a valid checkpoint. This happens in coordinator rank and is not a parallel operation. Given FSSpecWriter uses the same logic, the atomicity/crash-consistency of the whole checkpoint is guaranteed by the .metadata atomic write. Therefore, I don't understand why we need atomic writes for individual write-items/tensors ( which happens parallely and breaks the fs-spec TX as per the PR description). |
Actually, rename is not guaranteed to be an atomic operation on all fsspec backends (it's a copy + delete on S3FS for example). It is atomic on some backends like ADLFS, but not on others. Fsspec.transaction is also not implemented for all backends either, btw: so wrapping it in a transaction block will not guarantee it's atomic either. If you don't care about the deletion, I guess you could say it is atomic though. |
|
@Skylion007 @pradeepfn Do you have any suggestions on how to proceed here? Without this PR, our code is currently breaking with more recent fsspec, so I'm eager to try to get this merged. If we think that the per-file crash-consistency is unnecessary, I'd be happy to just remove it entirely as well. |
|
Ping @pradeepfn @fegin @LucasLLC |
|
Sorry to bug you again, but @pradeepfn @fegin @LucasLLC, but I'm interested in figuring out what the right path forward here is. I would like to be able to use newer versions of fsspec, which we currently cannot do with PyTorch right now. Is there something else that needs to happen to get this fix merged? |
fb5d66a to
307ccf0
Compare
|
@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Sorry to bug you again, but @Skylion007 @pradeepfn @wz337 could I get another review? I have changed the implementation strategy (to be much simpler) so that things will work with internal fsspec implementations (like Manifold). I am still open to just removing the rollback behavior entirely. |
6124fb8 to
5e50485
Compare
fsspec does not currently support concurrent transactions, and more recent versions make this an error. Since we are only doing single-file transactions, for now we can just manually try to delete the file (if necessary) after errors.
5e50485 to
920608a
Compare
|
@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Another ping @Skylion007 @pradeepfn @wz337 about the review. I am particularly interested about whether (at this point) we should just not have this per-file rollback attempt at all (since it wasn't even working before for filesystems that don't implement transactions, like Manifold). |
|
@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: pytorch#135541 Approved by: https://github.com/Skylion007
This reverts commit 59cf4bc. Reverted pytorch#135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](pytorch#135541 (comment)))
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: pytorch#135541 Approved by: https://github.com/Skylion007
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is not true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.
Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error).
Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using
fsspec's commit/rollback behavior because the internal PathManagerFileSystem used foriopathdoes not properly support that behavior.I don't have a minimal test-case, but in Meta this solves a broken test on
fsspec >= 2024.3.0:Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @soulitzer @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn