KEMBAR78
add transformers + openai_gpt_oss on modal to run by weedge · Pull Request #179 · ai-bot-pro/achatbot · GitHub
Skip to content

Conversation

@weedge
Copy link
Collaborator

@weedge weedge commented Aug 6, 2025

image image image image

colab:


AI generated contents:


# NOTE: u can use text tokenizer lib (tiktoken use openai-harmony or HF tokenizers lib) + gpt oss LLM generator

modal run src/download_models.py --repo-ids "openai/gpt-oss-20b"
modal run src/download_models.py --repo-ids "openai/gpt-oss-120b" --ignore-patterns "*.pt|*.bin|*original*|*metal*"


modal run src/llm/transformers/openai_gpt_oss.py --task tokenize --reasoning low 
modal run src/llm/transformers/openai_gpt_oss.py --task tokenize --reasoning medium
modal run src/llm/transformers/openai_gpt_oss.py --task tokenize --reasoning high
modal run src/llm/transformers/openai_gpt_oss.py --task tokenize --reasoning low  --model-identity "you are a helpful assistant."

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task dump_model
IMAGE_GPU=H100:4 LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task dump_model

QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task dump_model
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task dump_model
QUANTIZATION=mxfp4 IMAGE_GPU=H100 LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task dump_model

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task pipe
IMAGE_GPU=H100 modal run src/llm/transformers/openai_gpt_oss.py --task pipe
QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task pipe
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task pipe

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task generate
IMAGE_GPU=H100 modal run src/llm/transformers/openai_gpt_oss.py --task generate
QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task generate
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task generate

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning low
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning medium
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning high 
QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning high
IMAGE_GPU=H100 modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning low
IMAGE_GPU=H100 modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning medium
IMAGE_GPU=H100 modal run src/llm/transformers/openai_gpt_oss.py --task generate_stream --reasoning high 

modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_stream_decode_unicode

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate --reasoning low
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate --reasoning medium
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate --reasoning high

IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate_tool --reasoning low
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate_tool --reasoning medium
IMAGE_GPU=L40s modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate_tool --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=T4 modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate_tool --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=A100-80GB LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task openai_harmony_generate_tool --reasoning high

IMAGE_GPU=L4:3 modal run src/llm/transformers/openai_gpt_oss.py --task split_model

IMAGE_GPU=L4:3 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning low 
IMAGE_GPU=L4:3 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning medium
IMAGE_GPU=L4:3 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning high
IMAGE_GPU=L40s:1 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate
IMAGE_GPU=L40s:2 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning high
IMAGE_GPU=H100:1 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate
IMAGE_GPU=H100:2 modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning high
QUANTIZATION=mxfp4 IMAGE_GPU=L4:4 LLM_MODEL="openai/gpt-oss-120b" modal run src/llm/transformers/openai_gpt_oss.py --task multi_gpu_generate --reasoning high

reference

code

Signed-off-by: weedge <weege007@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new set of functionalities to run OpenAI's GPT-OSS models on Modal using the Hugging Face Transformers library. It provides a comprehensive framework for various LLM operations, including model downloading, tokenization, text generation (both standard and streaming), and advanced features like OpenAI Harmony integration for structured conversations and tool usage, with support for multi-GPU inference.

Highlights

  • Modal Integration for GPT-OSS: I've added a new Python script (openai_gpt_oss.py) that enables running OpenAI's GPT-OSS models on Modal, leveraging its scalable infrastructure for LLM inference.
  • Comprehensive LLM Task Support: The new script includes functions for a wide range of LLM tasks, such as tokenizing inputs with chat templates, performing text generation via Hugging Face pipelines, standard generation, and streaming generation.
  • OpenAI Harmony Integration: I've incorporated openai-harmony to demonstrate structured conversation handling, including streaming token decoding and advanced tool-calling capabilities, showcasing how to build more complex LLM applications.
  • Multi-GPU Inference Capabilities: The implementation now supports distributing large models across multiple GPUs for efficient generation, with a split_model function to manage device mapping.
  • Enhanced Model Download Process: I've updated the download_models.py script to increase CPU resources, extend timeout durations, and boost parallel workers, ensuring more robust and faster downloads for large language models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@weedge weedge added modal transformers GPT openai GPT model reasoning reasoning model labels Aug 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for running OpenAI's GPT-OSS models on Modal using the transformers library. It includes a new script with various tasks for tokenization, generation, streaming, and multi-GPU inference. The changes to download_models.py adjust resource allocation for downloading large models.

My review focuses on the new openai_gpt_oss.py script. I've identified a couple of correctness issues in the dump_model and split_model functions that could lead to incorrect behavior. I've also found some typos in the example commands that would cause runtime errors. I've provided suggestions to fix these issues.

@weedge weedge added the MoE label Aug 6, 2025
@weedge
Copy link
Collaborator Author

weedge commented Aug 6, 2025

openai/gpt-oss-20b 20914.757184 M parameters

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-23): 24 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)

after Mxfp4 Quant, GptOssExperts -> Mxfp4GptOssExperts

openai/gpt-oss-20b 1804.459584 M parameters

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-23): 24 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): Mxfp4GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)

openai/gpt-oss-120b 116829.156672 M parameters

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-35): 36 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)

after Mxfp4 Quant, GptOssExperts -> Mxfp4GptOssExperts

openai/gpt-oss-120b 2167.371072 M parameters

GptOssForCausalLM(
  (model): GptOssModel(
    (embed_tokens): Embedding(201088, 2880, padding_idx=199999)
    (layers): ModuleList(
      (0-35): 36 x GptOssDecoderLayer(
        (self_attn): GptOssAttention(
          (q_proj): Linear(in_features=2880, out_features=4096, bias=True)
          (k_proj): Linear(in_features=2880, out_features=512, bias=True)
          (v_proj): Linear(in_features=2880, out_features=512, bias=True)
          (o_proj): Linear(in_features=4096, out_features=2880, bias=True)
        )
        (mlp): GptOssMLP(
          (router): GptOssTopKRouter()
          (experts): Mxfp4GptOssExperts()
        )
        (input_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
        (post_attention_layernorm): GptOssRMSNorm((2880,), eps=1e-05)
      )
    )
    (norm): GptOssRMSNorm((2880,), eps=1e-05)
    (rotary_emb): GptOssRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2880, out_features=201088, bias=False)
)

@weedge
Copy link
Collaborator Author

weedge commented Aug 6, 2025

text tokenizer:

PreTrainedTokenizerFast(name_or_path='/root/.achatbot/models/openai/gpt-oss-20b', vocab_size=199998, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|return|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	199998: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	199999: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200000: AddedToken("<|reserved_200000|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200001: AddedToken("<|reserved_200001|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200002: AddedToken("<|return|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200003: AddedToken("<|constrain|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200004: AddedToken("<|reserved_200004|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200005: AddedToken("<|channel|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200006: AddedToken("<|start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200007: AddedToken("<|end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200008: AddedToken("<|message|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200009: AddedToken("<|reserved_200009|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200010: AddedToken("<|reserved_200010|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200011: AddedToken("<|reserved_200011|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200012: AddedToken("<|call|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200013: AddedToken("<|reserved_200013|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200014: AddedToken("<|reserved_200014|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200015: AddedToken("<|reserved_200015|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200016: AddedToken("<|reserved_200016|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200017: AddedToken("<|reserved_200017|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	200018: AddedToken("<|endofprompt|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Signed-off-by: weedge <weege007@gmail.com>
@weedge
Copy link
Collaborator Author

weedge commented Aug 11, 2025

MXFP4 与 FP4 的区别

MXFP4(Microscaling FP4)和 FP4 都是用于 AI 模型量化(如训练和推理)的低精度浮点格式,主要目的是减少内存占用和计算成本,同时尽量保持模型精度。两者都基于 4 位浮点表示,但 MXFP4 是 FP4 的扩展版本,引入了微缩放(Microscaling)机制。以下从定义、结构、机制和应用等方面详细比较。

基本定义

  • FP4:一种基本的 4 位浮点数据类型,常用于低精度计算。它是独立的浮点格式,没有内置的块级缩放机制,主要依赖软件实现缩放。FP4 由 Open Compute Project (OCP) 在 Microscaling Formats (MX) 规范中定义为元素数据类型。
  • MXFP4:Microscaling FP4 的缩写,是 OCP MX 规范中的一种具体格式。它将 FP4 作为元素数据类型,但添加了共享缩放因子,形成块浮点(Block Floating Point)结构,用于大规模 AI 操作。

结构和位表示

两者都使用 E2M1 格式作为核心(1 位符号位、2 位指数位、1 位尾数位),但 MXFP4 扩展了结构:

  • FP4:纯 4 位结构,无额外位。指数偏置为 1,支持正常数(±1.0 到 ±6.0)和次正常数(±0.5),不支持 Inf 或 NaN。
  • MXFP4:元素部分仍是 4 位 FP4,但每个块(32 个元素)共享一个 8 位缩放因子(E8M0,无符号,仅表示 Float32 的指数部分,指数偏置 127,范围 -127 到 127,有一个 NaN 编码)。整体上,一个 MXFP4 块占用 32×4 + 8 = 136 位。

值表示和缩放机制

  • FP4:值直接计算为 $v = (-1)^S \times 2^{E - 1} \times (1 + 2^{-1} \times M)$(正常数)或类似次正常数公式。动态范围有限(约 -6 到 6),转换时如果超出范围需 clamp 到最大值或转为零,依赖软件缩放调整范围。
  • MXFP4:值计算为 $v_i = X \times P_i$,其中 $X$ 是共享缩放因子(E8M0,表示 $2^{E - 127}$), $P_i$ 是 FP4 元素值。如果 $X = NaN$,所有元素均为 NaN。这种块级缩放扩展了动态范围(通过 power-of-two 缩放),减少量化误差,但块内元素共享相同缩放,可能在数据分布不均时引入误差。

比较表

使用表格形式直观比较关键区别:

方面 FP4 MXFP4
位宽 4 位 (E2M1) 元素 4 位 (E2M1) + 共享 8 位 scale (E8M0)
块大小 无(单个元素独立) 32 个元素共享一个 scale
缩放机制 无内置硬件缩放,依赖软件 内置硬件加速的 power-of-two 共享缩放
动态范围 有限(±0.5 到 ±6.0) 扩展(scale 支持 -127 到 127 指数)
精度风险 较高量化误差,适合简单场景 较低误差,通过块缩放改善,但块内共享可能导致局部误差
转换规则 支持 clamp 和 roundTiesToEven 继承 FP4 规则,若超出 Float32 范围行为实现定义
支持 NaN 无专用编码 scale 支持 NaN,整个块可设为 NaN

应用和优势

  • FP4:适合基本低精度计算,但由于范围小和无硬件缩放,精度损失较大,在大型模型中可能不理想。优势在于简单性,但 NVIDIA 等硬件支持中强调其准确性不如带缩放的格式。
  • MXFP4:专为 AI 量化设计(如 LLM 训练/推理),通过块缩放实现高效矩阵运算(e.g., GEMM),内存减少、速度更快。优势包括硬件兼容性和近无损精度,已集成到 OpenAI GPT-OSS、llama.cpp 等工具中。相比 FP4,MXFP4 在实际部署中更高效,但块大小较大(32)可能不如更细粒度的格式(如 NVFP4 的 16)。

总体而言,MXFP4 是 FP4 的“升级版”,通过引入微缩放机制解决了 FP4 的动态范围限制,使其更适合现代 AI 工作负载。如果需要更精细的比较,可参考 OCP MX 规范或 NVIDIA 的低精度格式文档。

@weedge
Copy link
Collaborator Author

weedge commented Aug 11, 2025

MXFP4 与 NVFP4 的区别

MXFP4(Microscaling FP4)和 NVFP4(NVIDIA FP4)都是用于 AI 模型量化(如训练和推理)的低精度 4 位浮点格式,旨在减少内存占用和计算成本,同时保持模型精度。两者均基于微缩放(Microscaling)机制,使用块浮点(Block Floating Point)结构,其中块内元素共享缩放因子。MXFP4 是 Open Compute Project (OCP) 的开放标准,而 NVFP4 是 NVIDIA 的专有格式,主要针对其 Blackwell 架构优化。NVFP4 可以视为 MXFP4 的变体,但引入了更细粒度的设计以提升准确性。

基本结构

两者元素数据类型均为 FP4 (E2M1):每个元素占 4 位,包括 1 位符号位(Sign)、2 位指数位(Exponent)和 1 位尾数位(Mantissa)。指数偏置为 1,支持数值范围约 ±0.5 到 ±6.0(包括正常数和次正常数),不支持 Inf 或 NaN。

然而,NVFP4 在缩放机制上进行了优化,引入了两级缩放(per-block 和 per-tensor),而 MXFP4 仅使用单级块缩放。

关键区别

  • 块大小(Block Size):这是两者最显著的差异。MXFP4 使用 32 个元素共享一个缩放因子,而 NVFP4 缩小到 16 个元素。这使得 NVFP4 的缩放更细粒度,能更好地适应数据的局部动态范围,减少量化误差,但可能增加计算开销。
  • 缩放因子(Scale Factor)
    • MXFP4:使用 E8M0(8 位,无符号,仅指数部分,偏置 127,范围 -127 到 127,支持一个 NaN 编码)。缩放仅为 power-of-two(2 的幂),导致较粗糙的量化(MSE 约为 0.72)。
    • NVFP4:使用 E4M3(FP8 格式,4 位指数 + 3 位尾数,提供分数精度,动态范围和精度的平衡)。此外,还有一个 per-tensor 的 FP32(E8M23)全局缩放因子,用于整体归一化。这降低了量化误差(MSE 约为 0.08),提升了准确性。
  • 值表示(Value Representation)
    • MXFP4:块内值计算为 $v_i = X \times P_i$,其中 $X$ 是 E8M0 缩放因子, $P_i$ 是 FP4 元素。如果 $X = NaN$,整个块为 NaN。
    • NVFP4:值计算为 $x = x_q \times s$,其中 $s$ 是 E4M3 缩放因子(动态计算以最小化块误差),并结合 per-tensor FP32 缩放。
  • 硬件加速和兼容性:两者均支持硬件加速缩放(不同于纯 FP4)。NVFP4 专为 NVIDIA Blackwell 设计,提供更高分辨率和准确性,而 MXFP4 是开放标准,更易于跨硬件兼容。

比较表

以下表格总结关键区别,使用表格形式以便直观比较:

方面 MXFP4 (OCP 标准) NVFP4 (NVIDIA 专有)
块大小 32 个元素 16 个元素(更细粒度)
缩放因子 E8M0 (8 位,power-of-two,无分数精度) E4M3 (FP8,4 位指数 + 3 位尾数,有分数精度) + per-tensor FP32
缩放机制 单级块缩放 两级缩放(per-block + per-tensor)
量化误差 (MSE) 较高(约 0.72) 较低(约 0.08)
准确性 较高风险的精度损失(相较 FP8) 更好,接近 FP8 精度(<1% 损失)
计算成本 较低 较高(但准确性更高)
内存效率 约 4.25 位/值(32×4 + 8 位 scale) 约 4.5 位/值(16×4 + 8 位 scale + FP32/tensor),相较 FP16 减少 3.5x
应用优势 更低成本,适合计算密集任务 更高准确性,适合大型模型推理和训练
硬件支持 开放标准,跨平台 NVIDIA Blackwell 优化

应用和优势

  • 准确性与权衡:NVFP4 通过更小的块和分数缩放显著降低量化误差,实验显示在 DeepSeek-R1-0528 等模型上,与 FP8 相比精度损失 <1%(例如 MMLU-PRO: 85% vs. 84%),甚至在某些任务上提升 2%。MXFP4 虽高效,但精度风险更高,尤其在大模型中。
  • 训练与推理:NVFP4 支持全 FP4 训练(如 Llama2 7B 参数模型),训练损失接近 BF16,经量化感知微调(QAF)后下游任务性能相当。MXFP4 更适合部分量化,实验显示块大小 32 稳定性稍差。
  • 性能提升:NVFP4 在推理中减少内存 1.8x(相较 FP8),能量效率可达 Blackwell Ultra 的 50x。MXFP4 强调成本降低,但 NVFP4 在实际任务中更平衡准确性和效率。

总体而言,NVFP4 是对 MXFP4 的改进,优先考虑准确性而非最低成本,适合 NVIDIA 生态中的高精度需求。如果需要在特定硬件上部署,可参考 NVIDIA 文档或 OCP 规范进行进一步评估。

@weedge
Copy link
Collaborator Author

weedge commented Aug 11, 2025

Open Compute Project (OCP) MXFP4:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GPT openai GPT model modal MoE reasoning reasoning model transformers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant