KEMBAR78
Add variable batch size support to TBE training by sryap · Pull Request #1752 · pytorch/FBGEMM · GitHub
Skip to content

Conversation

@sryap
Copy link
Contributor

@sryap sryap commented May 5, 2023

Summary:
This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

Usage:

# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)

Output format

{F967393126}

Limitation:

T and max_B have to fit in 32 bits.

  • We use lower info_B_num_bits bits to store b (bag ID; b < max_B). Supported max_B = 2^info_B_num_bits
  • We use upper 32 - info_B_num_bits bits to store t (table ID; t < T). Supported T = 2^(32 - info_B_num_bits)

Note that we adjust info_B_num_bits automatically at runtime based on max_B and T. If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

@netlify
Copy link

netlify bot commented May 5, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit dbff94e
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6476e964b5e0c200080d7e50

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 435e3e6d0b6166c6fbb2a8b92cab24dbf6d77933
@sryap sryap force-pushed the export-D42663369 branch from 01a243a to ba2dcb6 Compare May 5, 2023 18:49
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9918a51ac0be5da077e37bb9315716380c12b7e0
@sryap sryap force-pushed the export-D42663369 branch from ba2dcb6 to dcd47df Compare May 5, 2023 18:55
sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: d7dcc298d229468863e5ccef39ccf736d6fb0504
@sryap sryap force-pushed the export-D42663369 branch from dcd47df to 55f3989 Compare May 5, 2023 18:59
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: c2fda3eeadbfe95c52caa33289a6ea323ffbcec8
@sryap sryap force-pushed the export-D42663369 branch from 55f3989 to 37b16db Compare May 10, 2023 05:33
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 3e715f7a8372434a0531af181b3b8461cd4171ee
@sryap sryap force-pushed the export-D42663369 branch from 37b16db to 9ddcf46 Compare May 10, 2023 05:40
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9b70a672f9f4713e827cb096315ba91a2e701a76
@sryap sryap force-pushed the export-D42663369 branch from 9ddcf46 to 78e42e4 Compare May 10, 2023 05:45
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 7e9acf65e33f57b2ec876a0565ab80cd9e0fd3f8
@sryap sryap force-pushed the export-D42663369 branch from 78e42e4 to 7784112 Compare May 10, 2023 06:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 60edca0e6d9ee12e5201c7ee0a74a4c706b856e2
@sryap sryap force-pushed the export-D42663369 branch from 7784112 to e51cdf8 Compare May 16, 2023 17:24
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 0f89ad11f49a8241d8bf4407fe34038b000660ee
@sryap sryap force-pushed the export-D42663369 branch from e51cdf8 to 837c35f Compare May 16, 2023 17:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 74b0fbf41cd228cf449df15a0af755ad267a6bb5
@sryap sryap force-pushed the export-D42663369 branch from 837c35f to 5afde6b Compare May 17, 2023 18:50
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 06dfd8f09d08f20f2470c43d1e9368a8106c4bdd
@sryap sryap force-pushed the export-D42663369 branch from 5afde6b to dd47202 Compare May 17, 2023 18:55
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 19, 2023
Summary:
//caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_syncPull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: eccac79678b8ab482d4b0e91043197247648d650
@sryap sryap force-pushed the export-D42663369 branch from dd47202 to 4479de2 Compare May 19, 2023 17:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 31, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d71f8b06242756d966d1d6b5f04e83875f07f5bd
@sryap sryap force-pushed the export-D42663369 branch from 4479de2 to ebadaf1 Compare May 31, 2023 06:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d613b0a9ced838e3ae8b421a1e5a30de8b158e69
@sryap sryap force-pushed the export-D42663369 branch from ebadaf1 to dbff94e Compare May 31, 2023 06:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 05bf018.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants