Parallax: Efficient Distributed Large Language Model
Inference
Chris Tong Gufeng Chen Tianyi Zhao Xialie Zhuang
Sibian Lu Rymon Yu Eric Yang Lynn Ai
Gradient
Abstract
The exponential growth in large language model (LLM) parameters has created
significant barriers to accessible inference, with state-of-the-art models requiring
expensive centralized GPU clusters. We present PARALLAX, a comprehensive
distributed inference framework that enables efficient execution of large language
models across heterogeneous, decentralized compute resources, spanning from
data-center GPUs to consumer-grade devices like Apple Silicon Macs.
The core contribution is a novel distributed inference algorithm centered on pipeline
parallelism, designed with optimized communication patterns to minimize overhead
across distributed devices.
Through extensive evaluation on the Qwen2.5-72B-Instruct model with GPTQ-Int4
quantization, the results demonstrate that PARALLAX achieves significant perfor-
mance improvements over existing distributed inference systems: 3.1× reduction
in end-to-end latency, 5.3× improvement in inter-token latency, and 3.1× higher
throughput compared to state-of-the-art baselines. The framework successfully
enables accessible, high-performance LLM inference across distributed clusters of
heterogeneous devices, including both GPU nodes and consumer Macs.
1 Introduction
Large Language Models (LLMs) have fundamentally transformed artificial intelligence capabilities
across diverse domains [1, 2, 3, 4], demonstrating unprecedented performance in natural language
understanding, generation, and reasoning tasks. However, their massive parameter counts—frequently
exceeding 100 billion parameters—impose significant computational requirements that necessitate
clusters of high-end GPUs for efficient inference. This substantial hardware barrier creates ac-
cessibility challenges for researchers and developers, leaving vast computational resources—from
geographically distributed research GPUs to the powerful unified memory architecture of modern
consumer devices like Apple Silicon Macs—largely untapped for collaborative inference.
Current approaches to address these limitations suffer from critical trade-offs. Traditional single-node
optimizations, while improving local efficiency, cannot overcome fundamental hardware constraints
when model requirements exceed available resources. Conversely, cloud-based API services, though
accessible, restrict model introspection, customization, and fine-grained control over the inference
process—capabilities essential for research and specialized applications.
1.0.1 Motivation and Challenges
The increasing scale of large language models presents significant challenges for inference deploy-
ment. Modern LLMs such as Llama2-70B and GPT-4 require substantial computational resources that
often exceed the capacity of individual devices. Traditional approaches rely on expensive centralized
GPU clusters, limiting accessibility and creating bottlenecks in serving diverse user populations.
Three critical challenges motivate this work: resource accessibility (most researchers and developers
cannot afford large GPU clusters), untapped consumer hardware (powerful devices like Macs
are widespread but unutilized for large-scale inference), and computational efficiency (existing
distributed inference systems suffer from communication overhead and poor scaling).
1.0.2 Contributions
This work presents PARALLAX, a comprehensive distributed LLM inference framework that addresses
these challenges through novel algorithms and system design. The key contributions include:
1. P2P-Based Pipeline Parallelism: A novel distributed inference algorithm where pipeline
stages are mapped directly to nodes in a peer-to-peer network. This enables large models to
be partitioned across geographically distributed consumer devices, like Macs, communicat-
ing intermediate hidden states directly without a central coordinator.
2. Orchestration of Heterogeneous Hardware: The first framework to successfully orches-
trate a network of heterogeneous devices—spanning from distributed GPUs to consumer-
grade Apple Silicon Macs—into a cohesive, decentralized cluster for large-scale LLM
inference. It leverages SGLang with CUDA for GPU execution and the MLX framework for
Apple Silicon.
3. Comprehensive Performance Evaluation: Substantial performance improvements over
state-of-the-art baselines, achieving 3.1× lower latency and 5.3× better inter-token latency
compared to existing distributed inference systems on large-scale models.
2 Related Work
Our work builds on significant advancements in LLM serving, distributed systems, and parallel
computing. We position PARALLAX by analyzing state-of-the-art inference engines and distributed
frameworks, highlighting the architectural gaps that motivate our design for heterogeneous, geo-
distributed inference.
2.1 High-Performance Inference Engines
The efficiency of modern LLM serving is largely defined by systems optimized for single-node or
tightly-coupled datacenter environments. Frameworks like vLLM [5] introduced key optimizations
such as PAGEDATTENTION [6], which resolves internal memory fragmentation by managing the
KV cache in virtual memory blocks, and continuous batching, which allows for dynamic batching
of requests to maximize GPU utilization. For distributed deployments within datacenters, these
systems rely on high-bandwidth, low-latency interconnects (e.g., NVLink, InfiniBand) and collective
communication libraries like NCCL [7] to achieve efficient tensor parallelism. Similarly, SGL ANG
provides a flexible front-end language for complex generation tasks, backed by a highly optimized
GPU runtime with efficient CUDA kernels. These systems set a high bar for performance but are
fundamentally designed for centralized deployments with high-speed interconnects.
2.2 Parallelism Strategies for Distributed Inference
The choice of parallelism strategy is critical for distributed systems. Tensor parallelism [8, 9],
which partitions individual operations across devices, is highly effective but requires frequent, high-
bandwidth communication, making it suitable only for tightly-coupled GPUs in a datacenter.
In contrast, pipeline parallelism [10], which partitions model layers across different nodes, forms
the natural foundation for geographically distributed inference. Each node in the pipeline executes
a larger, more independent chunk of computation, and the communication of activations between
stages is less frequent and voluminous than the communication required by tensor parallelism.
PARALLAX adopts pipeline parallelism as its base strategy, enabling it to efficiently shard models
across heterogeneous, consumer-grade machines over standard internet connections.
2
Figure 1: Overview of PARALLAX Infrastructure showing the layered design from request interface
through scheduling and executor to model runner. Model runner is hardware-specific and supports
both GPU workers(via PyTorch/CUDA) and Apple Silicon workers (via MLX/Meta kernels).
2.3 Decentralized Inference Systems
On the decentralized end of the spectrum, P ETALS [11] pioneers collaborative inference across
the internet using the Hivemind library [12] for P2P communication. While innovative, its design
prioritizes model sharing over real-time performance, leading to several critical limitations. First, it
suffers from poor GPU utilization as it lacks optimized CUDA kernels for core operations. Second,
its architecture places a significant burden on the client, which is responsible for tokenization and
processing embeddings, including the computationally heavy lm_head. This creates a bottleneck
and wastes network bandwidth. P ETALS employs naive scheduling heuristics without fine-grained
request routing and lacks essential server-side optimizations like PAGEDATTENTION or continuous
batching, making it unsuitable for high-throughput, low-latency interactive applications.
PARALLAX addresses these gaps by combining a high-performance, server-side execution core
inspired by vLLM and SGLang with a decentralized architecture that offloads all heavy computation
from the client.
3 Distributed Inference Infrastructure
PARALLAX’s architecture is built on a P2P foundation that enables decentralized execution across
heterogeneous devices. The system employs a layered architecture that separates hardware-agnostic
orchestration from hardware-specific execution, enabling seamless operation across both GPU clusters
and Apple Silicon Macs.
As shown in Figure 1, the PARALLAX infrastructure consists of two main layers: (1) Scheduling for
distributed model allocation and request routing across devices, and (2) Execution for per-device
orchestration, runtime management, and hardware-specific inference.
3.1 Scheduling: Model Sharding Allocator + Request Router (Across Devices,
Hardware-Agnostic)
The top layer implements a hardware-agnostic scheduling system that manages model sharding allo-
cation and request routing across the distributed swarm. This layer employs a two-phase scheduling
approach:
3
Phase 1 - Layer Allocation: Uses a greedy heuristic algorithm to optimally partition model layers
across available devices. The allocation considers device capabilities, memory constraints, and
network topology to minimize communication overhead while maximizing resource utilization.
Phase 2 - Request Routing: Implements dynamic programming-based request routing that efficiently
distributes incoming requests across different pipeline replicas. The router maintains real-time load
balancing based on running batch sizes and KV pool status for overall system efficiency, adapting to
changing network conditions and device availability.
This scheduling layer is completely hardware-agnostic, enabling it to orchestrate both GPU clusters
and Apple Silicon Macs seamlessly within the same distributed inference pipeline.
3.2 Executor (Per-Device)
3.2.1 Orchestrator (Hardware-Agnostic)
The Orchestrator serves as the hardware-agnostic wrapper for all per-device operations, managing
the complete lifecycle of inference requests on each node. Its responsibilities include:
Model Sharding and Loading: Each rank hosts a specific range of model layers based on the
allocation from the scheduling layer. The initial rank additionally hosts the tokenizer and embedding
layer, while the final rank hosts the language model head (lm_head). This distribution minimizes
redundant computation and optimizes memory usage across the pipeline.
Request Processing: Handles incoming requests by building hidden states and metadata from raw
request formats. The Orchestrator prepares batches by managing prefill operations, decode phases,
and eviction strategies from running batches. It implements micro-batching based on the number of
participants in the pipeline to optimize throughput.
Model Execution Coordination: Orchestrates the interaction between the runtime level components
and the hardware-specific model runner, ensuring seamless data flow through the inference pipeline.
3.2.2 Runtime (Hardware-Agnostic)
The runtime level provides hardware-agnostic abstractions for continuous batching and inter-device
communication.
Batching Scheduler:
The batching scheduler implements continuous batching with fine-grained control over prefill and
decode preferences. It dynamically manages the request pool, accepting new requests and forming
optimal batches based on:
• Micro-batching Strategy: Adapts batch sizes based on the number of participants in the
pipeline to minimize pipeline bubbles and maximize throughput.
• Prefill/Decode Optimization: Intelligently balances prefill and decode operations to opti-
mize for either latency or throughput based on system requirements.
• Dynamic Request Management: Continuously monitors request queues and adjusts batch-
ing strategies in real-time to maintain optimal performance.
Communication Abstraction:
The communication layer provides a unified interface for inter-device communication across hetero-
geneous hardware. Built on DHT and Hivemind protocols, it handles:
• Cross-Platform Communication: Seamless data exchange between GPU clusters and
Apple Silicon Macs using protocol buffers for efficient serialization.
• Hidden State Transmission: Optimized protocols for passing hidden states and metadata
(end tokens, sequence positions) between pipeline stages.
• Network Adaptation: Dynamic adjustment of communication patterns based on network
topology and device capabilities.
4
3.2.3 Model Runner (Hardware-Specific)
The Model Runner represents the hardware-specific execution layer, optimized for each target
platform.
KV-Cache Manager:
The KV-cache manager handles efficient memory management for attention mechanisms [13, 14],
implementing:
• Memory Optimization: Efficient allocation and deallocation of key-value cache memory
based on sequence length and batch size.
• Cache Eviction: Intelligent eviction strategies to maximize cache hit rates while managing
memory constraints.
• Platform-Specific Optimization: Tailored memory management strategies for GPU unified
memory and Apple Silicon’s unified memory architecture.
Hardware-Specific Execution:
The Model Runner supports two execution backends:
GPU Execution (SGLang): Leverages SGLang’s optimized CUDA kernels for high-performance
inference on NVIDIA GPUs. This backend provides efficient matrix operations, optimized attention
mechanisms, and seamless integration with the distributed pipeline.
Apple Silicon Execution (MLX): Utilizes MLX’s Metal Performance Shaders [15, 16] for optimized
inference on Apple Silicon Macs. This backend takes advantage of the unified memory architecture
and specialized neural engine capabilities for efficient model execution.
Both backends maintain identical interfaces to the runtime layer, ensuring seamless operation within
the distributed pipeline while leveraging platform-specific optimizations.
4 Experimental Evaluation
This section presents comprehensive experiments to evaluate PARALLAX performance and compare
it with baseline distributed inference systems. The evaluation focuses on latency, throughput, and
scalability using real-world workloads.
4.1 Experimental Setup
4.1.1 Hardware Configuration
The evaluation is conducted on a distributed network of two nodes, each equipped with an NVIDIA
RTX 5090 GPU. While PARALLAX also supports distributed GPU environments, this configuration
is chosen to specifically validate its performance on consumer-grade hardware, which represents a
key and challenging use case for decentralized inference.
4.1.2 Models and Workloads
The evaluation uses two models to assess scalability across different parameter counts: the Qwen2.5-
72B-Instruct model and the larger Qwen3-235B-A22B-GPTQ-Int4 model [17] with GPTQ-Int4
quantization [18], with various input/output configurations:
• Single request configurations: 1×1K, 1×4K, 1×8K, 1×16K tokens input
• Multi-request configurations: 4×1K, 8×1K tokens input
• Fixed output length: 1024 tokens for all configurations
4.1.3 Baseline Systems
The comparison baseline is Petals, a state-of-the-art decentralized collaborative inference framework
that provides distributed LLM serving capabilities similar to the proposed system.
5
Table 1: Performance comparison of PARALLAX vs. Petals on Qwen2.5-72B model.
Framework Input Config E2E Lat. TTFT ITL Input TP Output TP
(s) (s) (ms) (tok/s) (tok/s)
1×4K 46.6 5.0 40.7 87.9 22.0
1×8K 52.7 9.9 41.8 155.5 19.4
PARALLAX (RTX 5090) 1×16K 64.6 20.6 43.0 255.0 15.8
4×1K 46.8 3.4 42.5 87.5 87.5
8×1K 62.4 7.9 53.3 131.3 131.3
1×1K 175.2 14.4 157.2 5.8 5.8
PARALLAX (RTX 5090 + Mac M4 Pro ) 1×4K 242.4 64.9 173.6 16.9 4.2
4×1K 544.5 65.1 468.7 7.5 7.5
Petals 1×4K 143.5 14.4 216.5 28.6 7.1
Table 2: Performance evaluation of PARALLAX on Qwen3-235B model.
Framework Input Config E2E Lat. TTFT ITL Input TP Output TP
(s) (s) (ms) (tok/s) (tok/s)
1×1K 65.5 2.9 61.2 15.6 15.6
PARALLAX (2×RTX 5090) 1×4K 75.1 13.4 60.3 54.5 13.6
4×1K 99.3 8.7 88.6 41.2 41.2
1×1K 104.9 8.1 94.6 9.8 9.8
PARALLAX (RTX 5090 + Mac M4 Pro) 1×4K 150.0 34.2 113.2 27.3 6.8
4×1K 320.4 30.2 283.6 12.8 12.8
4.2 Performance Evaluation
4.2.1 Latency and Throughput Analysis
Table 1 presents detailed performance metrics comparing PARALLAX with the Petals baseline across
different input configurations. All experiments use the Qwen2.5-72B-Instruct-GPTQ-Int4 model with
1024 output tokens, testing both RTX 5090 GPU cluster and heterogeneous RTX 5090 + Mac M4
Pro 64G distributed inference configurations.
Table 2 presents performance results for the larger Qwen3-235B-A22B-GPTQ-Int4 model, demon-
strating PARALLAX’s capability to scale to larger model sizes. All experiments use 1024 output
tokens and test scaling performance across both dual RTX 5090 GPU and heterogeneous RTX 5090 +
Mac M4 Pro 64G setups.
Key Findings:
• 72B Model Performance: PARALLAX achieves 3.1× lower end-to-end latency compared to
Petals (46.6s vs 143.5s for 1×4K configuration), with 5.3× better inter-token latency (40.7ms
vs 216.5ms)
• 235B Model Scaling: Successfully demonstrates scalability to larger models, with dual RTX
5090 achieving 75.1s end-to-end latency for 1×4K input on the 235B model, maintaining
consistent inter-token latency (60.3ms)
• Heterogeneous Hardware Performance: Both models show effective cross-platform
execution, with the 235B model achieving 150.0s end-to-end latency on heterogeneous RTX
5090 + Mac M4 Pro 64G setup
• Multi-Request Handling: Demonstrates strong concurrent processing capabilities, with
4×1K requests achieving 99.3s total latency on dual GPUs for the 235B model
• Hardware Utilization: Results validate PARALLAX’s ability to effectively utilize both
homogeneous GPU clusters and heterogeneous consumer hardware for large-scale LLM
inference across different model sizes
6
4.3 Scalability Analysis
The evaluation demonstrates that PARALLAX maintains consistent performance across different batch
sizes, input lengths, and model sizes . The system shows excellent scalability characteristics:
Model Size Scaling: PARALLAX successfully scales from 72B to 235B parameters, demonstrat-
ing its capability to handle increasingly large models while maintaining reasonable performance
characteristics.
Input Length Scaling: For the 72B model, performance remains stable as input length increases
from 4K to 16K tokens, with inter-token latency staying within a narrow range (40.7-53.3ms). The
235B model shows similar consistency with inter-token latency of 60.3-61.2ms across different input
configurations.
Concurrent Processing: Multi-request scenarios demonstrate effective resource utilization, with the
235B model achieving 99.3s total latency for 4×1K concurrent requests on dual RTX 5090 setup.
Hardware Heterogeneity: The system maintains performance across heterogeneous hardware
configurations, successfully orchestrating both GPU clusters and mixed GPU+Mac setups for models
of different sizes.
The experimental results demonstrate that PARALLAX successfully addresses distributed LLM
inference challenges across multiple dimensions of scale, achieving superior performance compared
to existing frameworks while maintaining flexibility in hardware deployment.
5 Conclusion
This paper presents PARALLAX, a distributed LLM inference framework that harnesses the untapped
potential of consumer hardware for large-scale AI. By implementing a novel P2P-based pipeline
parallelism strategy, PARALLAX successfully orchestrates a network of Apple Mac devices, trans-
forming them into a powerful, decentralized inference cluster. The experimental results demonstrate
3.1× lower end-to-end latency, 5.3× better inter-token latency, and 3.1× higher throughput compared
to existing decentralized systems.
The key contributions include: (1) a P2P architecture that maps pipeline stages to individual network
nodes, enabling direct hidden state exchange; and (2) the first successful demonstration of large-scale
LLM inference on a cluster of consumer-grade Macs, leveraging MLX for on-device performance.
PARALLAX marks a significant step towards democratizing access to large language models, proving
that accessible, high-performance LLM inference is achievable beyond centralized data centers and
on the hardware people already own.
References
[1] BigScience Workshop et al. “Bloom: A 176b-parameter open-access multilingual language
model”. In: arXiv preprint arXiv:2211.05100 (2022).
[2] Susan Zhang et al. “Opt: Open pre-trained transformer language models”. In: arXiv preprint
arXiv:2205.01068 (2022).
[3] Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”. In: arXiv preprint
arXiv:2307.09288 (2023).
[4] Josh Achiam et al. “Gpt-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2023).
[5] Woosuk Kwon et al. “Efficient memory management for large language model serving with
pagedattention”. In: Proceedings of the 29th symposium on operating systems principles. 2023,
pp. 611–626.
[6] Tri Dao et al. “Flashattention: Fast and memory-efficient exact attention with io-awareness”.
In: Advances in neural information processing systems 35 (2022), pp. 16344–16359.
[7] NVIDIA. NCCL: NVIDIA Collective Communications Library. https : / / developer .
nvidia.com/nccl. NVIDIA Developer Documentation. 2023.
7
[8] Mohammad Shoeybi et al. “Megatron-lm: Training multi-billion parameter language models
using model parallelism”. In: arXiv preprint arXiv:1909.08053 (2019).
[9] Deepak Narayanan et al. “Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM”. In: arXiv preprint arXiv:2104.04473 (2021).
[10] Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline
Parallelism”. In: Advances in Neural Information Processing Systems. Vol. 32. 2019.
[11] Alexander Borzunov et al. “Petals: Collaborative inference and fine-tuning of large models”.
In: arXiv preprint arXiv:2209.01188 (2022).
[12] Max Ryabinin et al. Hivemind: Decentralized Deep Learning in PyTorch. Online, Apr. 2020.
URL: https://github.com/learning-at-home/hivemind.
[13] Benjamin Lefaudeux et al. xFormers: A modular and hackable Transformer modelling library.
https://github.com/facebookresearch/xformers. 2022.
[14] Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with
PagedAttention”. In: arXiv preprint arXiv:2309.06180 (2023).
[15] Awni Hannun et al. MLX: An array framework for machine learning on Apple silicon. https:
//github.com/ml-explore/mlx. Apple Machine Learning Research. 2023.
[16] Apple. MLX: An array framework for machine learning on Apple silicon. https://github.
com/ml-explore/mlx. Apple Machine Learning Research, Updated version. 2024.
[17] Jinze Bai et al. “Qwen Technical Report”. In: arXiv preprint arXiv:2309.16609 (2024).
[18] Elias Frantar et al. “Gptq: Accurate post-training quantization for generative pre-trained
transformers”. In: arXiv preprint arXiv:2210.17323 (2022).