Fix the use of fsspec transactions #135541

alanhdu · 2024-09-09T21:52:22Z

fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is not true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using fsspec's commit/rollback behavior because the internal PathManagerFileSystem used for iopath does not properly support that behavior.

I don't have a minimal test-case, but in Meta this solves a broken test on fsspec >= 2024.3.0:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @soulitzer @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

linux-foundation-easycla · 2024-09-09T21:52:26Z

The committers listed above are authorized under a signed CLA.

✅ login: alanhdu / name: Alan Du (920608a, 0618bb5)

pytorch-bot · 2024-09-09T21:52:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135541

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 0618bb5 with merge base 34e4205 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alanhdu · 2024-09-09T21:57:36Z

/easycla

Skylion007 · 2024-09-10T13:47:23Z

Do we have an issue filed with fsspec, they would probably look into a fix for this (but this is a great temp workaround). If so, please link the issue in the comment.

Skylion007

Left some comments.

Skylion007 · 2024-09-10T13:48:05Z

torch/distributed/checkpoint/_fsspec_filesystem.py

Technically should use 'os.fspath' instead of str but should only matter for more exotic PathLike objects.

You could also merge both these with statements into one if you wanted to reduce indentation.

Skylion007 · 2024-09-10T13:49:29Z

torch/distributed/checkpoint/_fsspec_filesystem.py

Does this actually solve the problem or does it need to be a deep copy?

Not sure a shallow copy would be safe depending on how transactions are implemented on the backend for the specific fsspec endpoint.

On second thought, I think this is likely safe only if the file system wasn't already in transaction from a 3rd party lib etc. We could add an assert on that boolean, but that wouldn't fully protect from the race condition.

Actually, yeah reading the above comment this copy seems fine as it's only a race on the same stream.

Skylion007 · 2024-09-10T13:53:13Z

Looks like fixing the typing in this file also allowed flake8 and ruff to properly parse the types and find a few import issues.

fegin · 2024-09-10T15:34:31Z

@pradeepfn Please take a look at this PR.

alanhdu · 2024-09-10T16:55:11Z

@Skylion007 : re the copy, I've updated the implementation to be (hopefully) a little less dependent on the exact implementation -- here, we directly use fs._open and handle the commit + rollback manually. This relies on a private part of the interface (AbstractFileSystem._open), but that might be better than relying on the internal details of how transactions are implemented wrt copy?

Let me know what you think -- I'm happy to remove the last commit if you think it'd be cleaner without it.

Skylion007 · 2024-09-10T18:43:57Z

@martindurant Any thoughts on which approach is best?

martindurant · 2024-09-10T18:50:16Z

It is true that fsspec transactions were not designed with parallelism in mind, nor are reentrant. It would take some effort to correct this.

Before getting into details, can someone comment on the purpose of transactions in general? Any file can be opened with autocommit=False, and then be cancelled or committed, so you can manage a container of these however you like.

alanhdu · 2024-09-10T19:51:55Z

Before getting into details, can someone comment on the purpose of transactions in general? Any file can be opened with autocommit=False, and then be cancelled or committed, so you can manage a container of these however you like.

I admit I don't fully understand the code, but my read from the initial PR (#112191) is just to ensure there aren't half-written files on some kind of failure. So we actually need any concurrency support.

The code I pushed does what you suggest, manually using autocommit=False + commit/discard (although there's a bit of ceremony involved it in actually getting it to work).

martindurant · 2024-09-10T19:54:56Z

The code I pushed does what you suggest

Perfect :). Then you shouldn't need to copy any filesystem instances.

torch/distributed/checkpoint/_fsspec_filesystem.py

pradeepfn · 2024-09-11T18:25:41Z

torch/distributed/checkpoint/_fsspec_filesystem.py

Can you help me what is happening here?
It seems like we don't have to handle the same discard/commit semantics in binary mode (?). Can you help me understand why? ( or a link to appropriate reading). thanks.

It seems like we don't have to handle the same discard/commit semantics in binary mode (?).

The basic issue is that fs._open is only guaranteed to be implemented for binary mode -- so for text mode we just (recursively) get the binary stream (which has the discard + commit semantics) and then wraps it in an io.TextIOWrapper. The recursion ensures that the binary stream also has the same commit + discard semantics.

Does that help?

This is right. It would have made sense for fsspec to wrap (heh) TextIOWrapper for "t" modes, to provide the same commit/discard methods.

pradeepfn · 2024-09-11T18:42:30Z

During CP save, the file-system writer makes an atomic operation ( rename) on the checkpoint's .metadata file to mark the checkpoint as a valid checkpoint. This happens in coordinator rank and is not a parallel operation.

Given FSSpecWriter uses the same logic, the atomicity/crash-consistency of the whole checkpoint is guaranteed by the .metadata atomic write.

Therefore, I don't understand why we need atomic writes for individual write-items/tensors ( which happens parallely and breaks the fs-spec TX as per the PR description).

@fegin @LucasLLC Can you help me understand. thanks.

Skylion007 · 2024-09-12T14:23:56Z

During CP save, the file-system writer makes an atomic operation ( rename) on the checkpoint's .metadata file to mark the checkpoint as a valid checkpoint. This happens in coordinator rank and is not a parallel operation.

Given FSSpecWriter uses the same logic, the atomicity/crash-consistency of the whole checkpoint is guaranteed by the .metadata atomic write.

Therefore, I don't understand why we need atomic writes for individual write-items/tensors ( which happens parallely and breaks the fs-spec TX as per the PR description).

@fegin @LucasLLC Can you help me understand. thanks.

Actually, rename is not guaranteed to be an atomic operation on all fsspec backends (it's a copy + delete on S3FS for example). It is atomic on some backends like ADLFS, but not on others. Fsspec.transaction is also not implemented for all backends either, btw: so wrapping it in a transaction block will not guarantee it's atomic either. If you don't care about the deletion, I guess you could say it is atomic though.

alanhdu · 2024-09-24T18:00:17Z

@Skylion007 @pradeepfn Do you have any suggestions on how to proceed here? Without this PR, our code is currently breaking with more recent fsspec, so I'm eager to try to get this merged.

If we think that the per-file crash-consistency is unnecessary, I'd be happy to just remove it entirely as well.

kwen2501 · 2024-09-28T18:48:07Z

Ping @pradeepfn @fegin @LucasLLC

alanhdu · 2024-10-18T16:41:26Z

Sorry to bug you again, but @pradeepfn @fegin @LucasLLC, but I'm interested in figuring out what the right path forward here is. I would like to be able to use newer versions of fsspec, which we currently cannot do with PyTorch right now.

Is there something else that needs to happen to get this fix merged?

facebook-github-bot · 2024-11-07T18:01:19Z

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alanhdu · 2024-11-07T18:01:50Z

Sorry to bug you again, but @Skylion007 @pradeepfn @wz337 could I get another review? I have changed the implementation strategy (to be much simpler) so that things will work with internal fsspec implementations (like Manifold).

I am still open to just removing the rollback behavior entirely.

fsspec does not currently support concurrent transactions, and more recent versions make this an error. Since we are only doing single-file transactions, for now we can just manually try to delete the file (if necessary) after errors.

facebook-github-bot · 2024-11-19T21:21:08Z

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alanhdu · 2024-11-19T21:22:47Z

Another ping @Skylion007 @pradeepfn @wz337 about the review. I am particularly interested about whether (at this point) we should just not have this per-file rollback attempt at all (since it wasn't even working before for filesystems that don't implement transactions, like Manifold).

facebook-github-bot · 2024-11-21T21:24:24Z

@alanhdu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alanhdu · 2024-11-21T23:55:19Z

@pytorchbot merge

pytorchmergebot · 2024-11-21T23:57:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: pytorch#135541 Approved by: https://github.com/Skylion007

This reverts commit 59cf4bc. Reverted pytorch#135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](pytorch#135541 (comment)))

fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at https://github.com/fsspec/filesystem_spec/blob/76ca4a68885d572880ac6800f079738df562f02c/fsspec/transaction.py#L39 -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: pytorch#135541 Approved by: https://github.com/Skylion007

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 9, 2024

alanhdu requested a review from wz337 September 9, 2024 21:58

Skylion007 reviewed Sep 10, 2024

View reviewed changes

wz337 requested a review from pradeepfn September 10, 2024 14:53

alanhdu force-pushed the alandu/fsspec branch from 01956ae to ec7498c Compare September 10, 2024 16:45

Skylion007 reviewed Sep 11, 2024

View reviewed changes

torch/distributed/checkpoint/_fsspec_filesystem.py Outdated Show resolved Hide resolved

Skylion007 reviewed Sep 11, 2024

View reviewed changes

torch/distributed/checkpoint/_fsspec_filesystem.py Outdated Show resolved Hide resolved

Skylion007 approved these changes Sep 11, 2024

View reviewed changes

pradeepfn reviewed Sep 11, 2024

View reviewed changes

kwen2501 added the topic: bug fixes topic category label Sep 28, 2024

kwen2501 added module: activation checkpointing Related to activation checkpointing release notes: distributed (checkpoint) labels Sep 28, 2024

alanhdu force-pushed the alandu/fsspec branch from fb5d66a to 307ccf0 Compare October 25, 2024 08:53

alanhdu requested review from Skylion007 and pradeepfn November 7, 2024 18:00

alanhdu force-pushed the alandu/fsspec branch from 6124fb8 to 5e50485 Compare November 19, 2024 21:19

Do not use a concurrent fsspec transaction

920608a

fsspec does not currently support concurrent transactions, and more recent versions make this an error. Since we are only doing single-file transactions, for now we can just manually try to delete the file (if necessary) after errors.

alanhdu force-pushed the alandu/fsspec branch from 5e50485 to 920608a Compare November 19, 2024 21:20

Fix lintrunner errors

0618bb5

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot closed this in f869a0f Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

alanhdu deleted the alandu/fsspec branch August 9, 2025 00:34

Fix the use of fsspec transactions #135541

Fix the use of fsspec transactions #135541

Conversation

alanhdu commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135541

❗ 1 Active SEVs

✅ No Failures

Uh oh!

alanhdu commented Sep 9, 2024

Uh oh!

Skylion007 commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Sep 10, 2024

Uh oh!

fegin commented Sep 10, 2024

Uh oh!

alanhdu commented Sep 10, 2024

Uh oh!

Skylion007 commented Sep 10, 2024

Uh oh!

martindurant commented Sep 10, 2024

Uh oh!

alanhdu commented Sep 10, 2024

Uh oh!

martindurant commented Sep 10, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanhdu Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pradeepfn commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanhdu commented Sep 24, 2024

Uh oh!

kwen2501 commented Sep 28, 2024

Uh oh!

alanhdu commented Oct 18, 2024

Uh oh!

facebook-github-bot commented Nov 7, 2024

Uh oh!

alanhdu commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Nov 19, 2024

Uh oh!

alanhdu commented Nov 19, 2024

alanhdu commented Sep 9, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 9, 2024 •

edited

Loading

pytorch-bot bot commented Sep 9, 2024 •

edited

Loading

Skylion007 commented Sep 10, 2024 •

edited

Loading

alanhdu Sep 11, 2024 •

edited

Loading

pradeepfn commented Sep 11, 2024 •

edited

Loading

Skylion007 commented Sep 12, 2024 •

edited

Loading

alanhdu commented Nov 7, 2024 •

edited

Loading