Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training Blog Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training Introduction: Empowering the Next Generation of AI Builders We are excited to announce a partnership…PyTorch Team at Meta: Alireza Shamsoshoara, Lucas Pasqualin, Peng Zhang, Hamid Shojanazeri, Ahmad Sharif, Kiuk Chung, Lightning AI: Lightning: Luca AntigaOctober 22, 2025
torchcomms: a modern PyTorch communications API Blog torchcomms: a modern PyTorch communications API Introduction Torchcomms is a new experimental, lightweight communication API intended for use with PyTorch Distributed…Team torchcomms at MetaOctober 22, 2025
Helion: A High-Level DSL for Performant and Portable ML Kernels Blog Helion: A High-Level DSL for Performant and Portable ML Kernels Introduction to Helion In modern machine learning, the demand for high-performance computation has led to…PyTorch Team at MetaOctober 22, 2025
Introducing ExecuTorch 1.0: Powering the next generation of edge AI Blog Introducing ExecuTorch 1.0: Powering the next generation of edge AI TLDR ExecuTorch enables seamless, production-ready deployment of PyTorch models directly to edge devices (mobile, embedded,…PyTorch Team at MetaOctober 22, 2025
Introducing PyTorch Monarch Blog Introducing PyTorch Monarch We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous,…The PyTorch Team at MetaOctober 22, 2025
Introducing torchforge – a PyTorch native library for scalable RL post-training and agentic development Blog Introducing torchforge – a PyTorch native library for scalable RL post-training and agentic development In this post, we announce torchforge: A PyTorch-native agentic RL library that lets you focus…The PyTorch Team at MetaOctober 22, 2025
Enabling vLLM V1 on AMD GPUs With Triton Blog Enabling vLLM V1 on AMD GPUs With Triton What is vLLM V1? In January 2025, the vLLM team announced the alpha release of…vLLM Team at IBM Research, vLLM Team at Red Hat, and vLLM Team at AMDOctober 21, 2025
PyTorch 2.9 Release Blog Blog PyTorch 2.9 Release Blog We are excited to announce the release of PyTorch® 2.9 (release notes)! This release features: …PyTorch FoundationOctober 15, 2025
SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Blog SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips TLDR: Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single NVIDIA GH200 and…Xinyu Lian, Minjia Zhang (SSAIL Lab, University of Illinois Urbana-Champaign), Masahiro Tanaka (Anyscale), Olatunji Ruwase (Snowflake)October 9, 2025
When Quantization Isn’t Enough: Why 2:4 Sparsity Matters BlogCommunity When Quantization Isn’t Enough: Why 2:4 Sparsity Matters TL;DR Combining 2:4 sparsity with quantization offers a powerful approach to compress large language models…Mohammad Mozaffari, Jesse Cai, Supriya RaoOctober 6, 2025
TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub Blog TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration…Meta: Jerry Zhang, Scott Roy, Mergen Nachin, Kimish Patel, Supriya Rao, Jack Zhang, Guang Yang & Unsloth AI: Daniel HanSeptember 19, 2025
Experience in Reducing PT2 Compilation Time for Meta Internal Workloads Blog Experience in Reducing PT2 Compilation Time for Meta Internal Workloads The Challenge of PyTorch 2.0 Compilation Since the release of PyTorch 2.0 (PT2) and its…Mingming Ding, James Wu, Oguz Ulgen, Sam Larsen, Bob Ren, Laith Sakka, Pian Pawakapan, Animesh Jain, Edward Yang, Yuzhen Huang, Ruilin Chen, Daohang Shi, Shuai Yang, Menglu Yu, Chunzhi Yang, Jade NieSeptember 18, 2025
High-performance quantized LLM inference on Intel CPUs with native PyTorch Blog High-performance quantized LLM inference on Intel CPUs with native PyTorch PyTorch 2.8 has just been released with a set of exciting new features, including a…Intel PyTorch TeamSeptember 17, 2025
PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory Blog PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory Intel announces a major enhancement for distributed training in PyTorch 2.8: the native integration of…Intel PyTorch Team, Argonne National LaboratorySeptember 12, 2025
Disaggregated Inference at Scale with PyTorch & vLLM BlogCommunity Disaggregated Inference at Scale with PyTorch & vLLM Key takeaways: PyTorch and vLLM have been organically integrated to accelerate cutting-edge generative AI applications,…Hongyi Jia, Jinghui Zhang, Lu Fang, Stephen Chen, Yan Cui, Ye (Charlotte) Qi, Zijing LiuSeptember 12, 2025
Distributed Checkpoint: Efficient checkpointing in large-scale jobs Blog Distributed Checkpoint: Efficient checkpointing in large-scale jobs As training jobs become larger, the likelihood of failures such as preemptions, crashes, or infrastructure…Meta: Saurabh Mishra, Meet Vadakkanchery, Pradeep Fernando, Saiteja Samudrala Google: Gerson Kroiz, Jingxin Ye, Viacheslav KovalevskyiSeptember 11, 2025
Yellow Teaming on Arm: A look inside our responsible AI workshop BlogCommunity Yellow Teaming on Arm: A look inside our responsible AI workshop A few months back, I traveled to Berlin to attend the WeAreDevelopers World Congress. During…Annie TallundSeptember 5, 2025
Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX Blog Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX In this blog post, we explore the kernel design details presented in the paper Fast…Sijia Chen, Timothy Chou, Aurko Roy†, Hongtao Yu, Yuanwei (Kevin) Fang, Xiaodong Wang, Jiecao Yu, Tony CW Liu†, Chuanhao Zhuge, Josh Fromm, Ying Zhang†, Rohan Anil†, Ajit MathewsSeptember 5, 2025
PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs Blog PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs Large Language Models (LLMs) have transformed tasks across numerous industries, including drafting emails, generating code,…Intel PyTorch TeamSeptember 3, 2025
Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster Blog Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster tldr: 1.22x - 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16. We recently…Less Wright, Vasiliy Kuznetsov, Daniel Vega-Myhre, Driss Guessous, Hamid Shojanazeri, Elias Ellison, Martin Cala, Ethan PetersenSeptember 3, 2025