KEMBAR78
Fix the use of fsspec transactions by alanhdu · Pull Request #135541 · pytorch/pytorch · GitHub
Skip to content

Conversation

@alanhdu
Copy link
Contributor

@alanhdu alanhdu commented Sep 9, 2024

fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is not true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using fsspec's commit/rollback behavior because the internal PathManagerFileSystem used for iopath does not properly support that behavior.

I don't have a minimal test-case, but in Meta this solves a broken test on fsspec >= 2024.3.0:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @soulitzer @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 9, 2024
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 9, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135541

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 0618bb5 with merge base 34e4205 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@alanhdu
Copy link
Contributor Author

alanhdu commented Sep 9, 2024

/easycla

@alanhdu alanhdu requested a review from wz337 September 9, 2024 21:58
@Skylion007
Copy link
Collaborator

Skylion007 commented Sep 10, 2024

Do we have an issue filed with fsspec, they would probably look into a fix for this (but this is a great temp workaround). If so, please link the issue in the comment.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically should use 'os.fspath' instead of str but should only matter for more exotic PathLike objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also merge both these with statements into one if you wanted to reduce indentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually solve the problem or does it need to be a deep copy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure a shallow copy would be safe depending on how transactions are implemented on the backend for the specific fsspec endpoint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think this is likely safe only if the file system wasn't already in transaction from a 3rd party lib etc. We could add an assert on that boolean, but that wouldn't fully protect from the race condition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah reading the above comment this copy seems fine as it's only a race on the same stream.

@Skylion007
Copy link
Collaborator

Looks like fixing the typing in this file also allowed flake8 and ruff to properly parse the types and find a few import issues.

@wz337 wz337 requested a review from pradeepfn September 10, 2024 14:53
@fegin
Copy link
Contributor

fegin commented Sep 10, 2024

@pradeepfn Please take a look at this PR.

@alanhdu
Copy link
Contributor Author

alanhdu commented Sep 10, 2024

@Skylion007 : re the copy, I've updated the implementation to be (hopefully) a little less dependent on the exact implementation -- here, we directly use fs._open and handle the commit + rollback manually. This relies on a private part of the interface (AbstractFileSystem._open), but that might be better than relying on the internal details of how transactions are implemented wrt copy?

Let me know what you think -- I'm happy to remove the last commit if you think it'd be cleaner without it.

@Skylion007
Copy link
Collaborator

@martindurant Any thoughts on which approach is best?

@martindurant
Copy link

It is true that fsspec transactions were not designed with parallelism in mind, nor are reentrant. It would take some effort to correct this.

Before getting into details, can someone comment on the purpose of transactions in general? Any file can be opened with autocommit=False, and then be cancelled or committed, so you can manage a container of these however you like.

@alanhdu
Copy link
Contributor Author

alanhdu commented Sep 10, 2024

Before getting into details, can someone comment on the purpose of transactions in general? Any file can be opened with autocommit=False, and then be cancelled or committed, so you can manage a container of these however you like.

I admit I don't fully understand the code, but my read from the initial PR (#112191) is just to ensure there aren't half-written files on some kind of failure. So we actually need any concurrency support.

The code I pushed does what you suggest, manually using autocommit=False + commit/discard (although there's a bit of ceremony involved it in actually getting it to work).

@martindurant
Copy link

The code I pushed does what you suggest

Perfect :). Then you shouldn't need to copy any filesystem instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me what is happening here?
It seems like we don't have to handle the same discard/commit semantics in binary mode (?). Can you help me understand why? ( or a link to appropriate reading). thanks.

Copy link
Contributor Author

@alanhdu alanhdu Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we don't have to handle the same discard/commit semantics in binary mode (?).

The basic issue is that fs._open is only guaranteed to be implemented for binary mode -- so for text mode we just (recursively) get the binary stream (which has the discard + commit semantics) and then wraps it in an io.TextIOWrapper. The recursion ensures that the binary stream also has the same commit + discard semantics.

Does that help?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right. It would have made sense for fsspec to wrap (heh) TextIOWrapper for "t" modes, to provide the same commit/discard methods.

@pradeepfn
Copy link
Contributor

pradeepfn commented Sep 11, 2024

During CP save, the file-system writer makes an atomic operation ( rename) on the checkpoint's .metadata file to mark the checkpoint as a valid checkpoint. This happens in coordinator rank and is not a parallel operation.

Given FSSpecWriter uses the same logic, the atomicity/crash-consistency of the whole checkpoint is guaranteed by the .metadata atomic write.

Therefore, I don't understand why we need atomic writes for individual write-items/tensors ( which happens parallely and breaks the fs-spec TX as per the PR description).

@fegin @LucasLLC Can you help me understand. thanks.

@Skylion007
Copy link
Collaborator

Skylion007 commented Sep 12, 2024

During CP save, the file-system writer makes an atomic operation ( rename) on the checkpoint's .metadata file to mark the checkpoint as a valid checkpoint. This happens in coordinator rank and is not a parallel operation.

Given FSSpecWriter uses the same logic, the atomicity/crash-consistency of the whole checkpoint is guaranteed by the .metadata atomic write.

Therefore, I don't understand why we need atomic writes for individual write-items/tensors ( which happens parallely and breaks the fs-spec TX as per the PR description).

@fegin @LucasLLC Can you help me understand. thanks.

Actually, rename is not guaranteed to be an atomic operation on all fsspec backends (it's a copy + delete on S3FS for example). It is atomic on some backends like ADLFS, but not on others. Fsspec.transaction is also not implemented for all backends either, btw: so wrapping it in a transaction block will not guarantee it's atomic either. If you don't care about the deletion, I guess you could say it is atomic though.

@alanhdu
Copy link
Contributor Author

alanhdu commented Sep 24, 2024

@Skylion007 @pradeepfn Do you have any suggestions on how to proceed here? Without this PR, our code is currently breaking with more recent fsspec, so I'm eager to try to get this merged.

If we think that the per-file crash-consistency is unnecessary, I'd be happy to just remove it entirely as well.

@kwen2501 kwen2501 added the topic: bug fixes topic category label Sep 28, 2024
@kwen2501
Copy link
Contributor

Ping @pradeepfn @fegin @LucasLLC

@alanhdu
Copy link
Contributor Author

alanhdu commented Oct 18, 2024

Sorry to bug you again, but @pradeepfn @fegin @LucasLLC, but I'm interested in figuring out what the right path forward here is. I would like to be able to use newer versions of fsspec, which we currently cannot do with PyTorch right now.

Is there something else that needs to happen to get this fix merged?

@facebook-github-bot
Copy link
Contributor

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@alanhdu
Copy link
Contributor Author

alanhdu commented Nov 7, 2024

Sorry to bug you again, but @Skylion007 @pradeepfn @wz337 could I get another review? I have changed the implementation strategy (to be much simpler) so that things will work with internal fsspec implementations (like Manifold).

I am still open to just removing the rollback behavior entirely.

fsspec does not currently support concurrent transactions, and
more recent versions make this an error. Since we are only doing
single-file transactions, for now we can just manually try to
delete the file (if necessary) after errors.
@facebook-github-bot
Copy link
Contributor

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@alanhdu
Copy link
Contributor Author

alanhdu commented Nov 19, 2024

Another ping @Skylion007 @pradeepfn @wz337 about the review. I am particularly interested about whether (at this point) we should just not have this per-file rollback attempt at all (since it wasn't even working before for filesystems that don't implement transactions, like Manifold).

@facebook-github-bot
Copy link
Contributor

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@alanhdu
Copy link
Contributor Author

alanhdu commented Nov 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file.

I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

Pull Request resolved: pytorch#135541
Approved by: https://github.com/Skylion007
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This reverts commit 59cf4bc.

Reverted pytorch#135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](pytorch#135541 (comment)))
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior.

I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

Pull Request resolved: pytorch#135541
Approved by: https://github.com/Skylion007
@alanhdu alanhdu deleted the alandu/fsspec branch August 9, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: activation checkpointing Related to activation checkpointing oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) Reverted topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants