Remove dtype check on meta device #136774

ge0405 · 2024-09-26T17:49:40Z

Summary:

Latest Update

This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.

Background

T176105639

| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed check that forces weight dtype == per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |

Currently we are in case A. Users need to add use_fp32_embedding in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting use_fp32_embedding, they would fail the check that forces weight dtype == per_sample_weights dtype in meta_registration.

The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the meta_embedding_bag, weight and per_sample_weights don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for is_fast_path_index_select.

This diff

Therefore, this diff remove the unnecessary check to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).

Reference diffs to resolve this issue

Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, is_meta also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.

Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.

Test Plan:

APS

The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:

buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false

Result:
{F1461463993}

Reviewed By: ezyang

Differential Revision: D54175438

pytorch-bot · 2024-09-26T17:49:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136774

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c2a37a2 with merge base 69bcf10 ():

NEW FAILURE - The following job has failed:

Check Labels / Check labels (gh)
# This PR needs a release notes: label

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-26T17:49:54Z

This pull request was exported from Phabricator. Differential Revision: D54175438

github-actions · 2024-09-26T17:50:42Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

facebook-github-bot · 2024-10-11T22:21:11Z

This pull request was exported from Phabricator. Differential Revision: D54175438

Summary: # Background T176105639 | case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta | | A | fp32 | fp32 | good | good | | B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype | | C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good | | D | fp32 | fp16 | N/A | N/A | Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration. The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`. # This diff Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan). # Reference diffs to resolve this issue Diff 1: D52591217 This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks. Diff 2: D53232739 Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models. Test Plan: # APS The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device: ``` buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false ``` Result: {F1461463993} Reviewed By: ezyang Differential Revision: D54175438

facebook-github-bot · 2024-10-12T00:36:22Z

This pull request was exported from Phabricator. Differential Revision: D54175438

facebook-github-bot · 2024-10-12T05:43:25Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-10-12T05:45:14Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

facebook-github-bot added the fb-exported label Sep 26, 2024

ge0405 force-pushed the export-D54175438 branch from 4e1ded8 to a652b5e Compare October 11, 2024 22:21

ezyang approved these changes Oct 12, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 12, 2024

ge0405 force-pushed the export-D54175438 branch from a652b5e to c2a37a2 Compare October 12, 2024 00:36

pytorchmergebot added the merging label Oct 12, 2024

pytorchmergebot closed this in a777dea Oct 12, 2024

pytorchmergebot added Merged and removed merging labels Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove dtype check on meta device #136774

Remove dtype check on meta device #136774

Uh oh!

ge0405 commented Sep 26, 2024

Uh oh!

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 26, 2024

Uh oh!

github-actions bot commented Sep 26, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 12, 2024

Uh oh!

facebook-github-bot commented Oct 12, 2024

Uh oh!

pytorchmergebot commented Oct 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove dtype check on meta device #136774

Remove dtype check on meta device #136774

Uh oh!

Conversation

ge0405 commented Sep 26, 2024

Latest Update

Background

This diff

Reference diffs to resolve this issue

APS

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136774

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Sep 26, 2024

Uh oh!

github-actions bot commented Sep 26, 2024

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 12, 2024

Uh oh!

facebook-github-bot commented Oct 12, 2024

Uh oh!

pytorchmergebot commented Oct 12, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

This PR needs a `release notes:` label