-
Notifications
You must be signed in to change notification settings - Fork 670
Add variable batch size support to TBE training #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 435e3e6d0b6166c6fbb2a8b92cab24dbf6d77933
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 9918a51ac0be5da077e37bb9315716380c12b7e0
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: d7dcc298d229468863e5ccef39ccf736d6fb0504
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: c2fda3eeadbfe95c52caa33289a6ea323ffbcec8
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 3e715f7a8372434a0531af181b3b8461cd4171ee
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 9b70a672f9f4713e827cb096315ba91a2e701a76
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 7e9acf65e33f57b2ec876a0565ab80cd9e0fd3f8
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 60edca0e6d9ee12e5201c7ee0a74a4c706b856e2
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 0f89ad11f49a8241d8bf4407fe34038b000660ee
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: 74b0fbf41cd228cf449df15a0af755ad267a6bb5
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: 06dfd8f09d08f20f2470c43d1e9368a8106c4bdd
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_syncPull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: eccac79678b8ab482d4b0e91043197247648d650
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: d71f8b06242756d966d1d6b5f04e83875f07f5bd
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: d613b0a9ced838e3ae8b421a1e5a30de8b158e69
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
|
This pull request has been merged in 05bf018. |
Summary:
This diff adds the variable batch size (or variable length) support in split TBE training on GPU.
Usage:
Output format
{F967393126}
Limitation:
Tandmax_Bhave to fit in 32 bits.info_B_num_bitsbits to storeb(bag ID;b<max_B). Supportedmax_B=2^info_B_num_bits32 - info_B_num_bitsbits to storet(table ID;t<T). SupportedT=2^(32 - info_B_num_bits)Note that we adjust
info_B_num_bitsautomatically at runtime based onmax_BandT. If they cannot fit into 32 bits, it will abort.Differential Revision: D42663369