-
Notifications
You must be signed in to change notification settings - Fork 671
Add optimized TBE training forward #1641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 036d8d984fc4a0fbc4ac9e5a7fb746bf783dd80f
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 9991218d631aac96f7c54f8ecadf8e46de402a66
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: dbecfc183d4d93bb186a9db64c7cee81775c73aa
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 3ec1ef98e408bb8d06c44fc74a098f6b483833b2
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 42b8c5b853dd30df9bb3b2f808668d1ebf0db9a7
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 615518b61a305ae63e5cb6e9010ba4e9f7b689b9
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: ec11b9e8c665207200f0a8699a414757d4bd005e
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 2ede3008140d8f0d1c33d34867d0c3aaf3c98ce0
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: b36d4982c245f22187d27f150faaa93f541f7a5a
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 0f34b1e02932a71d225e19a44c45f18f29fc5e7c
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 96ad56f0e5567959fd28c72a649f862e1f5dd307
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 42fea6790c0fef1e60bae3d57c247ca61da46ec0
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 64d0d0752fc2689dae75ea1064a7c80551d3a15f
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 8be8753cba190ffde4c3be3a9e016cf09a99b5d4
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 3a3c3ce39c6a1deb1e217581e3717b98e7629e04
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: d03445efde04e978e8f5bb8853452a5c85ed9236
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 27876f466229cf1fd6a0aeb66e3d35bd6b43f930
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 30f9fc00c306515400e89d2f7c78063b75630722
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 3d5c90de057af284014a4a916f8aac1e0361750b
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 1c07c1e62f9fabde9ca4a5b166d666d8d01b1cf3
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 0e72b4809d2a7e26a8db88d8639c3d329ddd34ec
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 6953f7f8c9fd3a415d1ea5ed2af771ea85eb1d84
|
This pull request has been merged in d1c4a6f. |
|
@liligwu FYI, we currently disable this functionality on ROCm due to various compilation errors. This is the optimized table batched embedding implementation. Currently it is not used by default but this might change in the future. We are considering replacing the old implementation with the new one. |
|
Hi @sryap , thank you for letting us know the changes. |
Summary:
This diff adds an optimized implementation of TBE training forward,
namely
split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel.The implementation currently supports only a subset of usecases of TBE
including:
SplitTableBatchedEmbeddingBagsCodegen)pooling_mode:PoolingMode.SUM,PoolingMode.MEAN)per_sample_weights:Tensor,None)weights_precision:SparseType.FP32,SparseType.FP16)output_dtype:SparseType.FP32,SparseType.FP16)(
EmbeddingLocation:EmbeddingLocation.DEVICE,EmbeddingLocation.MANAGED,EmbeddingLocation.MANAGED_CACHING)Cases that the new implementation does NOT support:
DenseTableBatchedEmbeddingBagsCodegen)pooling_mode:PoolingMode.NONE)weights_precision:SparseType.FP8,SparseType.INT8,SparseType.INT4,SparseType.INT2,SparseType.BF16)weights_precision:SparseType.FP8,SparseType.INT8,SparseType.INT4,SparseType.INT2,SparseType.BF16)EmbeddingLocation:EmbeddingLocation.HOST)The
IS_EXPERIMENTALenvironment variable flag is added forenabling/disabling the new implementation at runtime. If
IS_EXPERIMENTALis not set, TBE will use the orignal implementation.If
IS_EXPERIMENTAL=1, TBE will use the new implementation. If theTBE usecases are not supported in the new implementation, TBE will
fall back to the original implementation. By default,
IS_EXPERIMENTALis not set.The new implementation contains the following optimizations:
number of registers per thread
memory
registers for compiler optimizations
warp offset computation
small Ls
Note: D = embedding dimension, L = pooling factor
Differential Revision: D43634651