KEMBAR78
Support Expert Parallelism by NouamaneTazi · Pull Request #72 · huggingface/nanotron · GitHub
Skip to content

Conversation

NouamaneTazi
Copy link
Member

No description provided.

@NouamaneTazi NouamaneTazi requested a review from xrsrke February 16, 2024 03:49
dp: Number of DP replicas
pp: Number of PP stages
tp: Number of TP replicas
expert_parallel_size: Number of expert parallel replicas (used only for MoEs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't expert_parallel_size should be the number of experts per tp rank?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, expert parallelism is orthogonal to TP. for example you can have 1 expert sharded along 2 tp ranks

@NouamaneTazi NouamaneTazi marked this pull request as ready for review February 16, 2024 16:41
@NouamaneTazi NouamaneTazi requested a review from xrsrke February 16, 2024 16:43
@NouamaneTazi NouamaneTazi merged commit b21538c into main Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants