0% found this document useful (0 votes)

27 views8 pages

Parallax Inference Paper

PARALLAX is a distributed inference framework designed to efficiently execute large language models across diverse computing resources, from data-center GPUs to consumer devices like Apple Silicon Macs. It introduces a novel pipeline parallelism algorithm that significantly reduces latency and improves throughput, achieving up to 3.1× performance gains over existing systems. The framework addresses accessibility challenges in LLM inference by leveraging untapped consumer hardware and optimizing communication patterns across decentralized networks.

Uploaded by

vivianenjoylife

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views8 pages

Parallax Inference Paper

Uploaded by

vivianenjoylife

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Parallax: Efficient Distributed Large Language Model

Inference

Chris Tong Gufeng Chen Tianyi Zhao Xialie Zhuang

Sibian Lu Rymon Yu Eric Yang Lynn Ai

Gradient

Abstract

The exponential growth in large language model (LLM) parameters has created
significant barriers to accessible inference, with state-of-the-art models requiring
expensive centralized GPU clusters. We present PARALLAX, a comprehensive
distributed inference framework that enables efficient execution of large language
models across heterogeneous, decentralized compute resources, spanning from
data-center GPUs to consumer-grade devices like Apple Silicon Macs.
The core contribution is a novel distributed inference algorithm centered on pipeline
parallelism, designed with optimized communication patterns to minimize overhead
across distributed devices.
Through extensive evaluation on the Qwen2.5-72B-Instruct model with GPTQ-Int4
quantization, the results demonstrate that PARALLAX achieves significant perfor-
mance improvements over existing distributed inference systems: 3.1× reduction
in end-to-end latency, 5.3× improvement in inter-token latency, and 3.1× higher
throughput compared to state-of-the-art baselines. The framework successfully
enables accessible, high-performance LLM inference across distributed clusters of
heterogeneous devices, including both GPU nodes and consumer Macs.

1 Introduction

Large Language Models (LLMs) have fundamentally transformed artificial intelligence capabilities
across diverse domains [1, 2, 3, 4], demonstrating unprecedented performance in natural language
understanding, generation, and reasoning tasks. However, their massive parameter counts—frequently
exceeding 100 billion parameters—impose significant computational requirements that necessitate
clusters of high-end GPUs for efficient inference. This substantial hardware barrier creates ac-
cessibility challenges for researchers and developers, leaving vast computational resources—from
geographically distributed research GPUs to the powerful unified memory architecture of modern
consumer devices like Apple Silicon Macs—largely untapped for collaborative inference.
Current approaches to address these limitations suffer from critical trade-offs. Traditional single-node
optimizations, while improving local efficiency, cannot overcome fundamental hardware constraints
when model requirements exceed available resources. Conversely, cloud-based API services, though
accessible, restrict model introspection, customization, and fine-grained control over the inference
process—capabilities essential for research and specialized applications.

1.0.1 Motivation and Challenges

The increasing scale of large language models presents significant challenges for inference deploy-
ment. Modern LLMs such as Llama2-70B and GPT-4 require substantial computational resources that
often exceed the capacity of individual devices. Traditional approaches rely on expensive centralized
GPU clusters, limiting accessibility and creating bottlenecks in serving diverse user populations.
Three critical challenges motivate this work: resource accessibility (most researchers and developers
cannot afford large GPU clusters), untapped consumer hardware (powerful devices like Macs
are widespread but unutilized for large-scale inference), and computational efficiency (existing
distributed inference systems suffer from communication overhead and poor scaling).

1.0.2 Contributions
This work presents PARALLAX, a comprehensive distributed LLM inference framework that addresses
these challenges through novel algorithms and system design. The key contributions include:

1. P2P-Based Pipeline Parallelism: A novel distributed inference algorithm where pipeline

stages are mapped directly to nodes in a peer-to-peer network. This enables large models to
be partitioned across geographically distributed consumer devices, like Macs, communicat-
ing intermediate hidden states directly without a central coordinator.
2. Orchestration of Heterogeneous Hardware: The first framework to successfully orches-
trate a network of heterogeneous devices—spanning from distributed GPUs to consumer-
grade Apple Silicon Macs—into a cohesive, decentralized cluster for large-scale LLM
inference. It leverages SGLang with CUDA for GPU execution and the MLX framework for
Apple Silicon.
3. Comprehensive Performance Evaluation: Substantial performance improvements over
state-of-the-art baselines, achieving 3.1× lower latency and 5.3× better inter-token latency
compared to existing distributed inference systems on large-scale models.

2 Related Work

Our work builds on significant advancements in LLM serving, distributed systems, and parallel
computing. We position PARALLAX by analyzing state-of-the-art inference engines and distributed
frameworks, highlighting the architectural gaps that motivate our design for heterogeneous, geo-
distributed inference.

2.1 High-Performance Inference Engines

The efficiency of modern LLM serving is largely defined by systems optimized for single-node or
tightly-coupled datacenter environments. Frameworks like vLLM [5] introduced key optimizations
such as PAGEDATTENTION [6], which resolves internal memory fragmentation by managing the
KV cache in virtual memory blocks, and continuous batching, which allows for dynamic batching
of requests to maximize GPU utilization. For distributed deployments within datacenters, these
systems rely on high-bandwidth, low-latency interconnects (e.g., NVLink, InfiniBand) and collective
communication libraries like NCCL [7] to achieve efficient tensor parallelism. Similarly, SGL ANG
provides a flexible front-end language for complex generation tasks, backed by a highly optimized
GPU runtime with efficient CUDA kernels. These systems set a high bar for performance but are
fundamentally designed for centralized deployments with high-speed interconnects.

2.2 Parallelism Strategies for Distributed Inference

The choice of parallelism strategy is critical for distributed systems. Tensor parallelism [8, 9],
which partitions individual operations across devices, is highly effective but requires frequent, high-
bandwidth communication, making it suitable only for tightly-coupled GPUs in a datacenter.
In contrast, pipeline parallelism [10], which partitions model layers across different nodes, forms
the natural foundation for geographically distributed inference. Each node in the pipeline executes
a larger, more independent chunk of computation, and the communication of activations between
stages is less frequent and voluminous than the communication required by tensor parallelism.
PARALLAX adopts pipeline parallelism as its base strategy, enabling it to efficiently shard models
across heterogeneous, consumer-grade machines over standard internet connections.

2
Figure 1: Overview of PARALLAX Infrastructure showing the layered design from request interface
through scheduling and executor to model runner. Model runner is hardware-specific and supports
both GPU workers(via PyTorch/CUDA) and Apple Silicon workers (via MLX/Meta kernels).

2.3 Decentralized Inference Systems

On the decentralized end of the spectrum, P ETALS [11] pioneers collaborative inference across
the internet using the Hivemind library [12] for P2P communication. While innovative, its design
prioritizes model sharing over real-time performance, leading to several critical limitations. First, it
suffers from poor GPU utilization as it lacks optimized CUDA kernels for core operations. Second,
its architecture places a significant burden on the client, which is responsible for tokenization and
processing embeddings, including the computationally heavy lm_head. This creates a bottleneck
and wastes network bandwidth. P ETALS employs naive scheduling heuristics without fine-grained
request routing and lacks essential server-side optimizations like PAGEDATTENTION or continuous
batching, making it unsuitable for high-throughput, low-latency interactive applications.
PARALLAX addresses these gaps by combining a high-performance, server-side execution core
inspired by vLLM and SGLang with a decentralized architecture that offloads all heavy computation
from the client.

3 Distributed Inference Infrastructure

PARALLAX’s architecture is built on a P2P foundation that enables decentralized execution across
heterogeneous devices. The system employs a layered architecture that separates hardware-agnostic
orchestration from hardware-specific execution, enabling seamless operation across both GPU clusters
and Apple Silicon Macs.
As shown in Figure 1, the PARALLAX infrastructure consists of two main layers: (1) Scheduling for
distributed model allocation and request routing across devices, and (2) Execution for per-device
orchestration, runtime management, and hardware-specific inference.

3.1 Scheduling: Model Sharding Allocator + Request Router (Across Devices,

Hardware-Agnostic)

The top layer implements a hardware-agnostic scheduling system that manages model sharding allo-
cation and request routing across the distributed swarm. This layer employs a two-phase scheduling
approach:

3
Phase 1 - Layer Allocation: Uses a greedy heuristic algorithm to optimally partition model layers
across available devices. The allocation considers device capabilities, memory constraints, and
network topology to minimize communication overhead while maximizing resource utilization.
Phase 2 - Request Routing: Implements dynamic programming-based request routing that efficiently
distributes incoming requests across different pipeline replicas. The router maintains real-time load
balancing based on running batch sizes and KV pool status for overall system efficiency, adapting to
changing network conditions and device availability.
This scheduling layer is completely hardware-agnostic, enabling it to orchestrate both GPU clusters
and Apple Silicon Macs seamlessly within the same distributed inference pipeline.

3.2 Executor (Per-Device)

3.2.1 Orchestrator (Hardware-Agnostic)

The Orchestrator serves as the hardware-agnostic wrapper for all per-device operations, managing
the complete lifecycle of inference requests on each node. Its responsibilities include:
Model Sharding and Loading: Each rank hosts a specific range of model layers based on the
allocation from the scheduling layer. The initial rank additionally hosts the tokenizer and embedding
layer, while the final rank hosts the language model head (lm_head). This distribution minimizes
redundant computation and optimizes memory usage across the pipeline.
Request Processing: Handles incoming requests by building hidden states and metadata from raw
request formats. The Orchestrator prepares batches by managing prefill operations, decode phases,
and eviction strategies from running batches. It implements micro-batching based on the number of
participants in the pipeline to optimize throughput.
Model Execution Coordination: Orchestrates the interaction between the runtime level components
and the hardware-specific model runner, ensuring seamless data flow through the inference pipeline.

3.2.2 Runtime (Hardware-Agnostic)

The runtime level provides hardware-agnostic abstractions for continuous batching and inter-device
communication.
Batching Scheduler:
The batching scheduler implements continuous batching with fine-grained control over prefill and
decode preferences. It dynamically manages the request pool, accepting new requests and forming
optimal batches based on:

• Micro-batching Strategy: Adapts batch sizes based on the number of participants in the
pipeline to minimize pipeline bubbles and maximize throughput.
• Prefill/Decode Optimization: Intelligently balances prefill and decode operations to opti-
mize for either latency or throughput based on system requirements.
• Dynamic Request Management: Continuously monitors request queues and adjusts batch-
ing strategies in real-time to maintain optimal performance.

Communication Abstraction:
The communication layer provides a unified interface for inter-device communication across hetero-
geneous hardware. Built on DHT and Hivemind protocols, it handles:

• Cross-Platform Communication: Seamless data exchange between GPU clusters and

Apple Silicon Macs using protocol buffers for efficient serialization.
• Hidden State Transmission: Optimized protocols for passing hidden states and metadata
(end tokens, sequence positions) between pipeline stages.
• Network Adaptation: Dynamic adjustment of communication patterns based on network
topology and device capabilities.

4
3.2.3 Model Runner (Hardware-Specific)
The Model Runner represents the hardware-specific execution layer, optimized for each target
platform.
KV-Cache Manager:
The KV-cache manager handles efficient memory management for attention mechanisms [13, 14],
implementing:

• Memory Optimization: Efficient allocation and deallocation of key-value cache memory

based on sequence length and batch size.
• Cache Eviction: Intelligent eviction strategies to maximize cache hit rates while managing
memory constraints.
• Platform-Specific Optimization: Tailored memory management strategies for GPU unified
memory and Apple Silicon’s unified memory architecture.

Hardware-Specific Execution:
The Model Runner supports two execution backends:
GPU Execution (SGLang): Leverages SGLang’s optimized CUDA kernels for high-performance
inference on NVIDIA GPUs. This backend provides efficient matrix operations, optimized attention
mechanisms, and seamless integration with the distributed pipeline.
Apple Silicon Execution (MLX): Utilizes MLX’s Metal Performance Shaders [15, 16] for optimized
inference on Apple Silicon Macs. This backend takes advantage of the unified memory architecture
and specialized neural engine capabilities for efficient model execution.
Both backends maintain identical interfaces to the runtime layer, ensuring seamless operation within
the distributed pipeline while leveraging platform-specific optimizations.

4 Experimental Evaluation
This section presents comprehensive experiments to evaluate PARALLAX performance and compare
it with baseline distributed inference systems. The evaluation focuses on latency, throughput, and
scalability using real-world workloads.

4.1 Experimental Setup

4.1.1 Hardware Configuration

The evaluation is conducted on a distributed network of two nodes, each equipped with an NVIDIA
RTX 5090 GPU. While PARALLAX also supports distributed GPU environments, this configuration
is chosen to specifically validate its performance on consumer-grade hardware, which represents a
key and challenging use case for decentralized inference.

4.1.2 Models and Workloads

The evaluation uses two models to assess scalability across different parameter counts: the Qwen2.5-
72B-Instruct model and the larger Qwen3-235B-A22B-GPTQ-Int4 model [17] with GPTQ-Int4
quantization [18], with various input/output configurations:

• Single request configurations: 1×1K, 1×4K, 1×8K, 1×16K tokens input

• Multi-request configurations: 4×1K, 8×1K tokens input
• Fixed output length: 1024 tokens for all configurations

4.1.3 Baseline Systems

The comparison baseline is Petals, a state-of-the-art decentralized collaborative inference framework
that provides distributed LLM serving capabilities similar to the proposed system.

5
Table 1: Performance comparison of PARALLAX vs. Petals on Qwen2.5-72B model.
Framework Input Config E2E Lat. TTFT ITL Input TP Output TP
(s) (s) (ms) (tok/s) (tok/s)
1×4K 46.6 5.0 40.7 87.9 22.0
1×8K 52.7 9.9 41.8 155.5 19.4
PARALLAX (RTX 5090) 1×16K 64.6 20.6 43.0 255.0 15.8
4×1K 46.8 3.4 42.5 87.5 87.5
8×1K 62.4 7.9 53.3 131.3 131.3
1×1K 175.2 14.4 157.2 5.8 5.8
PARALLAX (RTX 5090 + Mac M4 Pro ) 1×4K 242.4 64.9 173.6 16.9 4.2
4×1K 544.5 65.1 468.7 7.5 7.5
Petals 1×4K 143.5 14.4 216.5 28.6 7.1

Table 2: Performance evaluation of PARALLAX on Qwen3-235B model.

Framework Input Config E2E Lat. TTFT ITL Input TP Output TP
(s) (s) (ms) (tok/s) (tok/s)
1×1K 65.5 2.9 61.2 15.6 15.6
PARALLAX (2×RTX 5090) 1×4K 75.1 13.4 60.3 54.5 13.6
4×1K 99.3 8.7 88.6 41.2 41.2
1×1K 104.9 8.1 94.6 9.8 9.8
PARALLAX (RTX 5090 + Mac M4 Pro) 1×4K 150.0 34.2 113.2 27.3 6.8
4×1K 320.4 30.2 283.6 12.8 12.8

4.2 Performance Evaluation

4.2.1 Latency and Throughput Analysis

Table 1 presents detailed performance metrics comparing PARALLAX with the Petals baseline across
different input configurations. All experiments use the Qwen2.5-72B-Instruct-GPTQ-Int4 model with
1024 output tokens, testing both RTX 5090 GPU cluster and heterogeneous RTX 5090 + Mac M4
Pro 64G distributed inference configurations.
Table 2 presents performance results for the larger Qwen3-235B-A22B-GPTQ-Int4 model, demon-
strating PARALLAX’s capability to scale to larger model sizes. All experiments use 1024 output
tokens and test scaling performance across both dual RTX 5090 GPU and heterogeneous RTX 5090 +
Mac M4 Pro 64G setups.
Key Findings:

• 72B Model Performance: PARALLAX achieves 3.1× lower end-to-end latency compared to
Petals (46.6s vs 143.5s for 1×4K configuration), with 5.3× better inter-token latency (40.7ms
vs 216.5ms)
• 235B Model Scaling: Successfully demonstrates scalability to larger models, with dual RTX
5090 achieving 75.1s end-to-end latency for 1×4K input on the 235B model, maintaining
consistent inter-token latency (60.3ms)
• Heterogeneous Hardware Performance: Both models show effective cross-platform
execution, with the 235B model achieving 150.0s end-to-end latency on heterogeneous RTX
5090 + Mac M4 Pro 64G setup
• Multi-Request Handling: Demonstrates strong concurrent processing capabilities, with
4×1K requests achieving 99.3s total latency on dual GPUs for the 235B model
• Hardware Utilization: Results validate PARALLAX’s ability to effectively utilize both
homogeneous GPU clusters and heterogeneous consumer hardware for large-scale LLM
inference across different model sizes

6
4.3 Scalability Analysis

The evaluation demonstrates that PARALLAX maintains consistent performance across different batch
sizes, input lengths, and model sizes . The system shows excellent scalability characteristics:
Model Size Scaling: PARALLAX successfully scales from 72B to 235B parameters, demonstrat-
ing its capability to handle increasingly large models while maintaining reasonable performance
characteristics.
Input Length Scaling: For the 72B model, performance remains stable as input length increases
from 4K to 16K tokens, with inter-token latency staying within a narrow range (40.7-53.3ms). The
235B model shows similar consistency with inter-token latency of 60.3-61.2ms across different input
configurations.
Concurrent Processing: Multi-request scenarios demonstrate effective resource utilization, with the
235B model achieving 99.3s total latency for 4×1K concurrent requests on dual RTX 5090 setup.
Hardware Heterogeneity: The system maintains performance across heterogeneous hardware
configurations, successfully orchestrating both GPU clusters and mixed GPU+Mac setups for models
of different sizes.
The experimental results demonstrate that PARALLAX successfully addresses distributed LLM
inference challenges across multiple dimensions of scale, achieving superior performance compared
to existing frameworks while maintaining flexibility in hardware deployment.

5 Conclusion
This paper presents PARALLAX, a distributed LLM inference framework that harnesses the untapped
potential of consumer hardware for large-scale AI. By implementing a novel P2P-based pipeline
parallelism strategy, PARALLAX successfully orchestrates a network of Apple Mac devices, trans-
forming them into a powerful, decentralized inference cluster. The experimental results demonstrate
3.1× lower end-to-end latency, 5.3× better inter-token latency, and 3.1× higher throughput compared
to existing decentralized systems.
The key contributions include: (1) a P2P architecture that maps pipeline stages to individual network
nodes, enabling direct hidden state exchange; and (2) the first successful demonstration of large-scale
LLM inference on a cluster of consumer-grade Macs, leveraging MLX for on-device performance.
PARALLAX marks a significant step towards democratizing access to large language models, proving
that accessible, high-performance LLM inference is achievable beyond centralized data centers and
on the hardware people already own.

References
[1] BigScience Workshop et al. “Bloom: A 176b-parameter open-access multilingual language
model”. In: arXiv preprint arXiv:2211.05100 (2022).
[2] Susan Zhang et al. “Opt: Open pre-trained transformer language models”. In: arXiv preprint
arXiv:2205.01068 (2022).
[3] Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”. In: arXiv preprint
arXiv:2307.09288 (2023).
[4] Josh Achiam et al. “Gpt-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2023).
[5] Woosuk Kwon et al. “Efficient memory management for large language model serving with
pagedattention”. In: Proceedings of the 29th symposium on operating systems principles. 2023,
pp. 611–626.
[6] Tri Dao et al. “Flashattention: Fast and memory-efficient exact attention with io-awareness”.
In: Advances in neural information processing systems 35 (2022), pp. 16344–16359.
[7] NVIDIA. NCCL: NVIDIA Collective Communications Library. https : / / developer .
nvidia.com/nccl. NVIDIA Developer Documentation. 2023.

7
[8] Mohammad Shoeybi et al. “Megatron-lm: Training multi-billion parameter language models
using model parallelism”. In: arXiv preprint arXiv:1909.08053 (2019).
[9] Deepak Narayanan et al. “Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM”. In: arXiv preprint arXiv:2104.04473 (2021).
[10] Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline
Parallelism”. In: Advances in Neural Information Processing Systems. Vol. 32. 2019.
[11] Alexander Borzunov et al. “Petals: Collaborative inference and fine-tuning of large models”.
In: arXiv preprint arXiv:2209.01188 (2022).
[12] Max Ryabinin et al. Hivemind: Decentralized Deep Learning in PyTorch. Online, Apr. 2020.
URL: https://github.com/learning-at-home/hivemind.

[13] Benjamin Lefaudeux et al. xFormers: A modular and hackable Transformer modelling library.
https://github.com/facebookresearch/xformers. 2022.
[14] Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with
PagedAttention”. In: arXiv preprint arXiv:2309.06180 (2023).
[15] Awni Hannun et al. MLX: An array framework for machine learning on Apple silicon. https:
//github.com/ml-explore/mlx. Apple Machine Learning Research. 2023.
[16] Apple. MLX: An array framework for machine learning on Apple silicon. https://github.
com/ml-explore/mlx. Apple Machine Learning Research, Updated version. 2024.
[17] Jinze Bai et al. “Qwen Technical Report”. In: arXiv preprint arXiv:2309.16609 (2024).
[18] Elias Frantar et al. “Gptq: Accurate post-training quantization for generative pre-trained
transformers”. In: arXiv preprint arXiv:2210.17323 (2022).

Efficient Distributed LLM Inference
No ratings yet
Efficient Distributed LLM Inference
20 pages
LLM-Mesh: Enabling Elastic Sharing For Serverless LLM Inference
No ratings yet
LLM-Mesh: Enabling Elastic Sharing For Serverless LLM Inference
13 pages
ServerlessLLM Low-Latency Serverless Inference For Large Language Models
No ratings yet
ServerlessLLM Low-Latency Serverless Inference For Large Language Models
20 pages
Eecs 2024 108
No ratings yet
Eecs 2024 108
48 pages
Song Sosp24
No ratings yet
Song Sosp24
17 pages
LLM Inference Unveiled: Survey and Roofline Model Insights
No ratings yet
LLM Inference Unveiled: Survey and Roofline Model Insights
38 pages
Choosing The Right Inference Framework - LLM Inference Handbook
No ratings yet
Choosing The Right Inference Framework - LLM Inference Handbook
3 pages
Splitwise Efficient Generative LLM Inference Using Phase Splitting
No ratings yet
Splitwise Efficient Generative LLM Inference Using Phase Splitting
15 pages
Twinpilots
No ratings yet
Twinpilots
7 pages
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
No ratings yet
Torchtitan: One-Stop Pytorch Native Solution For Production Ready LLM Pretraining
21 pages
Splitwise: Efficient Generative LLM Inference Using Phase Splitting
No ratings yet
Splitwise: Efficient Generative LLM Inference Using Phase Splitting
15 pages
Overlap of Computation and Communication Within Seqenence For LLM Inference
No ratings yet
Overlap of Computation and Communication Within Seqenence For LLM Inference
8 pages
Dynamic Space Time Scheduling For GPU in
No ratings yet
Dynamic Space Time Scheduling For GPU in
8 pages
NanoFlow - Towards Optimal Large Language Model Serving Throughput
No ratings yet
NanoFlow - Towards Optimal Large Language Model Serving Throughput
19 pages
2tpds 2020 3046440
No ratings yet
2tpds 2020 3046440
12 pages
Edges Hard
No ratings yet
Edges Hard
11 pages
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
No ratings yet
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
4 pages
Tensor Flow
No ratings yet
Tensor Flow
19 pages
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
No ratings yet
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
16 pages
Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
No ratings yet
Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
6 pages
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
No ratings yet
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
18 pages
PRIMA - CPP: Speeding Up 70B-Scale LLM Inference On Low-Resource Everyday Home Clusters
No ratings yet
PRIMA - CPP: Speeding Up 70B-Scale LLM Inference On Low-Resource Everyday Home Clusters
23 pages
Efficiently Scaling Transformer Inference
No ratings yet
Efficiently Scaling Transformer Inference
18 pages
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
36 pages
Efficient LLM Inference Guide
No ratings yet
Efficient LLM Inference Guide
27 pages
LEAF - A Benchmark For Federated Settings
No ratings yet
LEAF - A Benchmark For Federated Settings
9 pages
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
No ratings yet
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
15 pages
Real-Time Machine Learning: The Missing Pieces
No ratings yet
Real-Time Machine Learning: The Missing Pieces
6 pages
Towards Greener LLMS: Bringing Energy-Efficiency To The Forefront of LLM Inference
No ratings yet
Towards Greener LLMS: Bringing Energy-Efficiency To The Forefront of LLM Inference
6 pages
Jaghouar 等 - INTELLECT-1 Technical Report
No ratings yet
Jaghouar 等 - INTELLECT-1 Technical Report
19 pages
AI Accelerators For Large Language Model In-Ference: Architecture Analysis and Scaling Strategies
No ratings yet
AI Accelerators For Large Language Model In-Ference: Architecture Analysis and Scaling Strategies
21 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
15 pages
Tensorflow Internal
No ratings yet
Tensorflow Internal
17 pages
Networking 4 Inference
No ratings yet
Networking 4 Inference
35 pages
A Software-Defined Tensor Streaming Multiprocessor For Large-Scale Machine Learning
No ratings yet
A Software-Defined Tensor Streaming Multiprocessor For Large-Scale Machine Learning
14 pages
LLM Cps
No ratings yet
LLM Cps
3 pages
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
No ratings yet
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
9 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
SLO-Aware GPU DVFS For Energy-Efficient LLM Inference Serving
No ratings yet
SLO-Aware GPU DVFS For Energy-Efficient LLM Inference Serving
4 pages
Fedlab: A Flexible Federated Learning Framework: Dun Zeng Siqi Liang Xiangjing Hu Hui Wang Zenglin Xu
No ratings yet
Fedlab: A Flexible Federated Learning Framework: Dun Zeng Siqi Liang Xiangjing Hu Hui Wang Zenglin Xu
7 pages
Sosp2024 Powerinfer Slides
No ratings yet
Sosp2024 Powerinfer Slides
23 pages
TensorFlow On Cloud
No ratings yet
TensorFlow On Cloud
13 pages
Federated Learning A Survery
No ratings yet
Federated Learning A Survery
31 pages
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
No ratings yet
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
13 pages
S: Efficient LLM Inference by Piggybacking Decodes With Chunked Prefills
No ratings yet
S: Efficient LLM Inference by Piggybacking Decodes With Chunked Prefills
16 pages
Tensorflow On Cloud: Shilpa Das
No ratings yet
Tensorflow On Cloud: Shilpa Das
13 pages
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
No ratings yet
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
28 pages
Data Parallelism
No ratings yet
Data Parallelism
5 pages
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
No ratings yet
Unlocking LLM Performance With Ebpf Optimizing Training and Inference Pipelines Chuan Hui Ebpfji Xi Llmxia Daep Xiao Zhen Relia Fa Qiu Yang Xiang Yunshan Networks Inc 1
37 pages
A Review On Edge Large Language Models: Design, Execution, and Applications
No ratings yet
A Review On Edge Large Language Models: Design, Execution, and Applications
37 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
Accelerating Federated Learning Via Momentum Gradient Descent
No ratings yet
Accelerating Federated Learning Via Momentum Gradient Descent
13 pages
Open Source FL Frameworks Ranking
No ratings yet
Open Source FL Frameworks Ranking
26 pages
Incomings Courses+in+english PDF
No ratings yet
Incomings Courses+in+english PDF
9 pages
Extrahop Mib
No ratings yet
Extrahop Mib
6 pages
Kuka
100% (2)
Kuka
13 pages
In The 1960s and 1970s Dennis Ritchie and Ken Thompson Invented Unix
No ratings yet
In The 1960s and 1970s Dennis Ritchie and Ken Thompson Invented Unix
3 pages
Arduino Programming Basics
No ratings yet
Arduino Programming Basics
1 page
Entc 3320: 555 Timer
No ratings yet
Entc 3320: 555 Timer
20 pages
Applsci 12 08252
No ratings yet
Applsci 12 08252
20 pages
Quickstart Guide: Trackfish 6500
No ratings yet
Quickstart Guide: Trackfish 6500
4 pages
A Review On Emerging Smart Technological Innovations in Healthcare Sector For Increasing Patient's Medication Adherence
No ratings yet
A Review On Emerging Smart Technological Innovations in Healthcare Sector For Increasing Patient's Medication Adherence
7 pages
SDLC
100% (3)
SDLC
85 pages
REview Form
No ratings yet
REview Form
5 pages
Linear Programming Duality Guide
No ratings yet
Linear Programming Duality Guide
17 pages
DSEWebNet Smart Device Application Manual
No ratings yet
DSEWebNet Smart Device Application Manual
46 pages
BA7205 Information Management
0% (1)
BA7205 Information Management
20 pages
Sacrifial Interface (SI)
No ratings yet
Sacrifial Interface (SI)
3 pages
Goldman Sachs Cover Letter Advice
100% (2)
Goldman Sachs Cover Letter Advice
7 pages
Spooky2 Users Guide 20200917
100% (2)
Spooky2 Users Guide 20200917
241 pages
Cabling Guide For Console and AUX Ports - Cisco
No ratings yet
Cabling Guide For Console and AUX Ports - Cisco
12 pages
Icet Inst English
No ratings yet
Icet Inst English
8 pages
Spring MVC Framework Guide
No ratings yet
Spring MVC Framework Guide
65 pages
692283-Phone Repair StepbyStep Flowchart Diagrams
No ratings yet
692283-Phone Repair StepbyStep Flowchart Diagrams
52 pages
Chapter7 2
No ratings yet
Chapter7 2
23 pages
LG K8 (2017) - Schematic Diagarm PDF
No ratings yet
LG K8 (2017) - Schematic Diagarm PDF
141 pages
Gif Category
No ratings yet
Gif Category
2 pages
SWOT Analysis of Samsung Corporation LTD
100% (1)
SWOT Analysis of Samsung Corporation LTD
5 pages
A. Introduction Handouts
No ratings yet
A. Introduction Handouts
6 pages
Graphic Design Solutions 5th Edition Robin Landa Solutions Manualinstant Download
100% (10)
Graphic Design Solutions 5th Edition Robin Landa Solutions Manualinstant Download
46 pages
Department of Electrical Engineering: Dec20012 - Programming Fundamentals (Practical Report)
No ratings yet
Department of Electrical Engineering: Dec20012 - Programming Fundamentals (Practical Report)
21 pages
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
100% (1)
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
5 pages

Parallax Inference Paper

Uploaded by

Parallax Inference Paper

Uploaded by

Parallax: Efficient Distributed Large Language Model

Chris Tong Gufeng Chen Tianyi Zhao Xialie Zhuang

1.0.1 Motivation and Challenges

1. P2P-Based Pipeline Parallelism: A novel distributed inference algorithm where pipeline

2.1 High-Performance Inference Engines

2.2 Parallelism Strategies for Distributed Inference

2.3 Decentralized Inference Systems

3 Distributed Inference Infrastructure

3.1 Scheduling: Model Sharding Allocator + Request Router (Across Devices,

3.2 Executor (Per-Device)

3.2.1 Orchestrator (Hardware-Agnostic)

3.2.2 Runtime (Hardware-Agnostic)

• Cross-Platform Communication: Seamless data exchange between GPU clusters and

• Memory Optimization: Efficient allocation and deallocation of key-value cache memory

4.1 Experimental Setup

4.1.1 Hardware Configuration

4.1.2 Models and Workloads

• Single request configurations: 1×1K, 1×4K, 1×8K, 1×16K tokens input

4.1.3 Baseline Systems

Table 2: Performance evaluation of PARALLAX on Qwen3-235B model.

4.2 Performance Evaluation

4.2.1 Latency and Throughput Analysis

You might also like