KEMBAR78

Add variable batch size support to TBE training by sryap · Pull Request #1752 · pytorch/FBGEMM · GitHub

Add variable batch size support to TBE training #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

sryap wants to merge 1 commit into pytorch:main from sryap:export-D42663369

Contributor

sryap commented May 5, 2023

Summary:
This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

Usage:

# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)

Output format

{F967393126}

Limitation:

T and max_B have to fit in 32 bits.

We use lower info_B_num_bits bits to store b (bag ID; b < max_B). Supported max_B = 2^info_B_num_bits
We use upper 32 - info_B_num_bits bits to store t (table ID; t < T). Supported T = 2^(32 - info_B_num_bits)

Note that we adjust info_B_num_bits automatically at runtime based on max_B and T. If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

facebook-github-bot added the cla signed label

netlify bot commented May 5, 2023 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`dbff94e`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6476e964b5e0c200080d7e50

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

ba2dcb6

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 435e3e6d0b6166c6fbb2a8b92cab24dbf6d77933

sryap force-pushed the export-D42663369 branch from 01a243a to ba2dcb6 Compare

May 5, 2023 18:49

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

1 similar comment

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

dcd47df

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9918a51ac0be5da077e37bb9315716380c12b7e0

sryap force-pushed the export-D42663369 branch from ba2dcb6 to dcd47df Compare

May 5, 2023 18:55

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

55f3989

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: d7dcc298d229468863e5ccef39ccf736d6fb0504

sryap force-pushed the export-D42663369 branch from dcd47df to 55f3989 Compare

May 5, 2023 18:59

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

37b16db

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: c2fda3eeadbfe95c52caa33289a6ea323ffbcec8

sryap force-pushed the export-D42663369 branch from 55f3989 to 37b16db Compare

May 10, 2023 05:33

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

9ddcf46

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 3e715f7a8372434a0531af181b3b8461cd4171ee

sryap force-pushed the export-D42663369 branch from 37b16db to 9ddcf46 Compare

May 10, 2023 05:40

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

78e42e4

Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9b70a672f9f4713e827cb096315ba91a2e701a76

sryap force-pushed the export-D42663369 branch from 9ddcf46 to 78e42e4 Compare

May 10, 2023 05:45

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 7e9acf65e33f57b2ec876a0565ab80cd9e0fd3f8

sryap force-pushed the export-D42663369 branch from 78e42e4 to 7784112 Compare

May 10, 2023 06:28

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

e51cdf8

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 60edca0e6d9ee12e5201c7ee0a74a4c706b856e2

sryap force-pushed the export-D42663369 branch from 7784112 to e51cdf8 Compare

May 16, 2023 17:24

Contributor

facebook-github-bot commented May 16, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

837c35f

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 0f89ad11f49a8241d8bf4407fe34038b000660ee

sryap force-pushed the export-D42663369 branch from e51cdf8 to 837c35f Compare

May 16, 2023 17:29

Contributor

facebook-github-bot commented May 16, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

5afde6b

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 74b0fbf41cd228cf449df15a0af755ad267a6bb5

sryap force-pushed the export-D42663369 branch from 837c35f to 5afde6b Compare

May 17, 2023 18:50

Contributor

facebook-github-bot commented May 17, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

dd47202

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 06dfd8f09d08f20f2470c43d1e9368a8106c4bdd

sryap force-pushed the export-D42663369 branch from 5afde6b to dd47202 Compare

May 17, 2023 18:55

Contributor

facebook-github-bot commented May 17, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training

4479de2

Summary:
//caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_syncPull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: eccac79678b8ab482d4b0e91043197247648d650

sryap force-pushed the export-D42663369 branch from dd47202 to 4479de2 Compare

May 19, 2023 17:28

Contributor

facebook-github-bot commented May 19, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1752)

ebadaf1

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d71f8b06242756d966d1d6b5f04e83875f07f5bd

sryap force-pushed the export-D42663369 branch from 4479de2 to ebadaf1 Compare

May 31, 2023 06:22

Contributor

facebook-github-bot commented May 31, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369


          Add variable batch size support to TBE training (pytorch#1752)

dbff94e

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d613b0a9ced838e3ae8b421a1e5a30de8b158e69

sryap force-pushed the export-D42663369 branch from ebadaf1 to dbff94e Compare

May 31, 2023 06:29

Contributor

facebook-github-bot commented May 31, 2023

This pull request was exported from Phabricator. Differential Revision: D42663369

facebook-github-bot closed this in

05bf018

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented May 31, 2023

This pull request has been merged in 05bf018.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged