-
Notifications
You must be signed in to change notification settings - Fork 16
add transformers + openai_gpt_oss on modal to run #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: weedge <weege007@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new set of functionalities to run OpenAI's GPT-OSS models on Modal using the Hugging Face Transformers library. It provides a comprehensive framework for various LLM operations, including model downloading, tokenization, text generation (both standard and streaming), and advanced features like OpenAI Harmony integration for structured conversations and tool usage, with support for multi-GPU inference.
Highlights
- Modal Integration for GPT-OSS: I've added a new Python script (
openai_gpt_oss.py) that enables running OpenAI's GPT-OSS models on Modal, leveraging its scalable infrastructure for LLM inference. - Comprehensive LLM Task Support: The new script includes functions for a wide range of LLM tasks, such as tokenizing inputs with chat templates, performing text generation via Hugging Face pipelines, standard generation, and streaming generation.
- OpenAI Harmony Integration: I've incorporated
openai-harmonyto demonstrate structured conversation handling, including streaming token decoding and advanced tool-calling capabilities, showcasing how to build more complex LLM applications. - Multi-GPU Inference Capabilities: The implementation now supports distributing large models across multiple GPUs for efficient generation, with a
split_modelfunction to manage device mapping. - Enhanced Model Download Process: I've updated the
download_models.pyscript to increase CPU resources, extend timeout durations, and boost parallel workers, ensuring more robust and faster downloads for large language models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for running OpenAI's GPT-OSS models on Modal using the transformers library. It includes a new script with various tasks for tokenization, generation, streaming, and multi-GPU inference. The changes to download_models.py adjust resource allocation for downloading large models.
My review focuses on the new openai_gpt_oss.py script. I've identified a couple of correctness issues in the dump_model and split_model functions that could lead to incorrect behavior. I've also found some typos in the example commands that would cause runtime errors. I've provided suggestions to fix these issues.
|
openai/gpt-oss-20b 20914.757184 M parameters after Mxfp4 Quant, GptOssExperts -> Mxfp4GptOssExperts openai/gpt-oss-20b 1804.459584 M parameters openai/gpt-oss-120b 116829.156672 M parameters after Mxfp4 Quant, GptOssExperts -> Mxfp4GptOssExperts openai/gpt-oss-120b 2167.371072 M parameters |
|
text tokenizer: |
MXFP4 与 FP4 的区别MXFP4(Microscaling FP4)和 FP4 都是用于 AI 模型量化(如训练和推理)的低精度浮点格式,主要目的是减少内存占用和计算成本,同时尽量保持模型精度。两者都基于 4 位浮点表示,但 MXFP4 是 FP4 的扩展版本,引入了微缩放(Microscaling)机制。以下从定义、结构、机制和应用等方面详细比较。 基本定义
结构和位表示两者都使用 E2M1 格式作为核心(1 位符号位、2 位指数位、1 位尾数位),但 MXFP4 扩展了结构:
值表示和缩放机制
比较表使用表格形式直观比较关键区别:
应用和优势
总体而言,MXFP4 是 FP4 的“升级版”,通过引入微缩放机制解决了 FP4 的动态范围限制,使其更适合现代 AI 工作负载。如果需要更精细的比较,可参考 OCP MX 规范或 NVIDIA 的低精度格式文档。 |
MXFP4 与 NVFP4 的区别MXFP4(Microscaling FP4)和 NVFP4(NVIDIA FP4)都是用于 AI 模型量化(如训练和推理)的低精度 4 位浮点格式,旨在减少内存占用和计算成本,同时保持模型精度。两者均基于微缩放(Microscaling)机制,使用块浮点(Block Floating Point)结构,其中块内元素共享缩放因子。MXFP4 是 Open Compute Project (OCP) 的开放标准,而 NVFP4 是 NVIDIA 的专有格式,主要针对其 Blackwell 架构优化。NVFP4 可以视为 MXFP4 的变体,但引入了更细粒度的设计以提升准确性。 基本结构两者元素数据类型均为 FP4 (E2M1):每个元素占 4 位,包括 1 位符号位(Sign)、2 位指数位(Exponent)和 1 位尾数位(Mantissa)。指数偏置为 1,支持数值范围约 ±0.5 到 ±6.0(包括正常数和次正常数),不支持 Inf 或 NaN。 然而,NVFP4 在缩放机制上进行了优化,引入了两级缩放(per-block 和 per-tensor),而 MXFP4 仅使用单级块缩放。 关键区别
比较表以下表格总结关键区别,使用表格形式以便直观比较:
应用和优势
总体而言,NVFP4 是对 MXFP4 的改进,优先考虑准确性而非最低成本,适合 NVIDIA 生态中的高精度需求。如果需要在特定硬件上部署,可参考 NVIDIA 文档或 OCP 规范进行进一步评估。 |

colab:
AI generated contents:
reference
code