Evaluating Bufferless Flow Control For On-Chip Networks
Evaluating Bufferless Flow Control For On-Chip Networks
   Abstract—With the emergence of on-chip networks, the power                all outputs that would reduce their distance to their destinations
consumed by router buffers has become a primary concern. Buffer-             regardless of dimension order constraints. In an 8×8 2D mesh
less flow control addresses this issue by removing router buffers,
                                                                             with 5×5 routers and uniform traffic, this routing scheme
and handles contention by dropping or deflecting flits. This work
compares virtual-channel (buffered) and deflection (packet-switched          reduces latency by 5% over dimension-ordered routing (DOR).
bufferless) flow control. Our evaluation includes optimizations for              We find that bufferless flow control provides a minimal
both schemes: buffered networks use custom SRAM-based buffers                advantage at best: in a lightly-loaded 8×8 2D mesh, bufferless
and empty buffer bypassing for energy efficiency, while bufferless           flow control reduces power consumption by only 1.5%, mostly
networks feature a novel routing scheme that reduces average
latency by 5%. Results show that unless process constraints lead             due to buffer leakage power, when using a high-performance,
to excessively costly buffers, the performance, cost and increased           high-leakage process. However, at medium or high loads the
complexity of deflection flow control outweigh its potential gains:          buffered network offers significantly better performance and
bufferless designs are only marginally (up to 1.5%) more energy              higher power efficiency with 21% more throughput per unit
efficient at very light loads, and buffered networks provide lower la-
tency and higher throughput per unit power under most conditions.
                                                                             power, as well as a 17% lower average latency at a 20%
                                                                             flit injection rate. The buffered network becomes more energy
                                                                             efficiency at flit injection rates of 7% (11% with low-swing
                                                                             channels). Buffer optimizations play a crucial role: at a flit in-
                        I. I NTRODUCTION
                                                                             jection rate of 20%, buffers without bypassing consume 8.5× the
   Continued improvements in VLSI technology enable inte-                    dynamic power with bypassing. Finally, the age-based allocator
gration of an increasing number of logic blocks on a single                  required to prevent livelocks in BLESS is 81% slower than an
chip. Scalable packet-switched networks-on-chip (NoCs) [7]                   input-first separable switch allocator used in VC flow control.
have been developed to serve the communication needs of such                     The rest of this paper is organized as follows: Section II
large systems. As system size increases, these interconnects                 provides the necessary background on bufferless interconnects.
become a crucial factor in the performance and cost of the chip.             Section III discusses our evaluation methodology. Section IV
   Compared to off-chip networks, on-chip wires are cheaper                  presents our novel routing scheme for BLESS. Section V exam-
and buffer cost is more significant [7]. Router buffers are used             ines the implications for router microarchitecture. In Section VI
to queue packets or flits that cannot be routed immediately                  we present our evaluation and results. Section VII discusses
due to contention [6]. Several proposals eliminate router buffers            further design parameters that can affect our comparisons.
to reduce NoC cost. In these bufferless schemes, contending                  Finally, Section VIII concludes this paper.
packets or flits are either dropped and retransmitted by their
source [9] or deflected [2], [18] to a free output port. Frequent                                    II. BACKGROUND
retransmissions or deflections degrade network performance.                     Deflection flow control was first proposed as “hot-potato”
However, under light load, dropping or deflecting may occur                  routing in off-chip networks [2]. Recent work has found that
infrequently enough to have a small impact on performance.                   network topology is the most important factor affecting perfor-
   Bufferless flow control proposals often report large area and             mance, and that global or history-related deflection criteria are
power savings compared to conventional buffered networks (e.g.               beneficial [16]. Furthermore, dynamic routing can be used to
60% area and up to 39% energy savings in a conventional CMP                  provide an upper bound for delivery time [4].
network [18]). However, previous work has aimed to reduce the                   In this paper, we consider BLESS, a state-of-the-art bufferless
cost of router buffers. For example, by using custom SRAM-                   deflection flow control proposal for NoCs [18]. In BLESS, flits
based implementations, buffers can consume as little as 6%                   bid for their preferred output. If the allocator is unable to grant
of total network area and 15.5% of total network power in a                  that output, the flit is deflected to any free output. This requires
flattened butterfly (FBFly) network [14], [17]. Furthermore, flits           that routers have at least as many outputs as inputs. Flits bid
can bypass empty buffers in the absence of contention [24], re-              for a single output port, following deterministic DOR. To avoid
ducing dynamic power consumption in lightly-loaded networks.                 livelock, older flits are given priority. Finally, injecting flits to
These optimizations may reduce buffering overhead up to a point              a router requires a free output port to avoid deflecting flits to
where the extra complexity and performance issues of bufferless              ejection ports.
flow control outweigh potential cost savings.                                   Two BLESS variants were evaluated in [18], FLIT-BLESS
   In this paper, we compare a state-of-the-art packet-switched              and WORM-BLESS. In FLIT-BLESS, every flit of a packet can
bufferless network with deflecting flow control, BLESS [18],                 be routed independently. Thus, all flits need to contain routing
and the currently-dominant virtual channel (VC) buffered flow                information, imposing overhead compared to buffered networks,
control [6]. To perform an equitable comparison, we optimize                 where only head flits contain routing information. To reduce
both networks. In particular, VC networks feature efficient                  this overhead, WORM-BLESS tries to avoid splitting worms by
custom SRAM buffers and empty buffer bypassing. We also                      providing subsequent flits in a packet with higher priority for
propose a novel routing scheme for BLESS, where flits bid for                allocating the same output as the previous flit. However, worms
                                                                         1
may still have to be split under congestion, and WORM-BLESS               communication protocol is assumed. The VC network uses
still needs to be able to route all flits independently.                  DOR for the mesh and FBFly. The deflection network uses
   Due to the lack of VCs, traffic classes need to be separated           multidimensional routing, explained in Section IV. We do not
with extra mechanisms or by duplicating physical channels. Ad-            assume adaptive routing for the VC network since such a
ditionally, BLESS requires extra logic and buffering at network           comparison would require adaptive routing for the bufferless
destinations to reorder flits of the same packet arriving out             network as well. We choose the number of VCs and buffer slots
of order. The number of packets that can arrive interleaved is            to maximize throughput per unit power. While this penalizes
practically unbounded, unlike VC networks where destinations              the VC network in area efficiency, power is usually the primary
just need a FIFO buffer per VC.                                           constraint.
   While most bufferless proposals either drop or deflect flits,             We generate results for either uniform random traffic or we
both mechanisms can be combined [8]. Other techniques also                average over a set of traffic patterns: uniform random, random
eliminate buffers: circuit-switching relies on establishing end-          permutation, shuffle, bit complement, tornado and neighbor [5].
to-end circuits in which flits never contend [15]. Finally, elastic       This set is extended for the FBFly to include transpose and a
buffer flow control [17] uses channels as distributed FIFOs in            traffic pattern that illustrates the effects of adversarial traffic for
place of router buffers.                                                  networks with a concentration factor. Averaging among traffic
                                                                          patterns makes our results less sensitive to effects caused by
                      III. M ETHODOLOGY                                   specific traffic patterns.
   We use a modified version of Booksim [5] for cycle-accurate
microarchitecture-level network simulation. To estimate area and                    IV. ROUTING IN B UFFERLESS N ETWORKS
power we use ITRS predictions for a 32nm high-performance                     BLESS networks use DOR [18]. In VC networks, DOR pre-
process [12], operating at 70◦ C. Modeling buffer costs accu-             vents cyclic network dependencies without extra VCs. However,
rately is fundamental in our study. Orion [23] is the standard            in bufferless networks flits never block waiting for buffers, so
modeling tool in NoC studies, but a recent study shows that it            there can be no network deadlocks, making DOR unnecessary.
can lead to large errors [13], and the update fixing these issues             We propose two oblivious routing algorithms that decrease
was not available at the time of this work. Instead, we use the           deflection probability. We observe that a flit often has several
models from Balfour and Dally [1], which are derived from                 productive outputs (i.e. outputs that would get the flit closer to
basic principles, and validate SRAM models using HSPICE.                  its destination). For example, in a 2D mesh, those are the two
   We assume a clock frequency of 2GHz and 512-bit packets.               outputs shown in Figure 1(a), unless the flit is already at one
We model channel wires as being routed above other logic and              of the axes of its final destination. Our first routing algorithm,
include only repeater and flip-flop area in channel area. The             multi-dimensional routing (MDR), exploits choice by having
number and size of repeaters per wire segment are chosen to               flits request all of their productive outputs. If both outputs are
minimize energy. Our conservative low-swing model has 30%                 available, the switch allocator assigns one pseudorandomly.
of the full-swing repeated wire traversal power and twice the                 With MDR, there is one productive output in each dimension
channel area [11]. Router area is estimated using detailed floor-         with remaining hops. If a flit exhausts all hops in a dimension,
plans. VC buffers use efficient custom SRAM-based buffers.                it will have one less productive output, increasing its deflection
We do not use area and power models for the allocators, but               probability. We can improve MDR by prioritizing the dimension
perform a detailed comparison by synthesizing them. Synthesis             that has the most remaining hops, which increases the number
is performed using Synopsys Design Compiler and a low-power               of productive outputs at subsequent hops. We call this scheme
commercial 45nm library under worst-case conditions. Place                prioritized MDR (PMDR). Figure 1(b) shows an example path
and route is done using Cadence Silicon Encounter. Local clock            with PMDR in a 2D mesh. Due to PMDR, all the hops except
gating is enabled.                                                        the last one have two productive outputs. In an FBFly with
   We choose FLIT-BLESS for our comparisons. FLIT-BLESS                   minimal routing, flits only take one hop in each dimension,
performs better than WORM-BLESS [18], but incurs extra                    so PMDR is equivalent to MDR. However, PMDR increases
overhead because all flits contain routing information. However,          allocator complexity: since a BLESS allocator already needs to
in our evaluation we do not model this overhead, giving BLESS             prioritize flits by age, PMDR requires either prioritizing output
a small advantage over buffered flow control.                             ports or two allocation iterations.
   We use two topologies for a single physical network with                   Figure 2 compares DOR, MDR and PMDR in mesh and
64 terminals. The first is an 8×8 2D mesh with single-cycle               FBFly bufferless networks. In the mesh, MDR offers 5% lower
channels. Routers are 5×5 and have one terminal connected                 average latency than DOR and equal maximum throughput. In
to them. The second is a 2D FBFly [14] with four terminals                the FBFly, MDR achieves 2% lower average latency and 3%
connected to each router. Therefore, there are 16 10×10 routers           higher maximum throughput. Under a sample 20% flit injection
laid out on a 4×4 grid. Short, medium and long channels are               rate, 13% more flits were only able to choose a single output
two, four and six clock cycles long, respectively. Injection and          with DOR compared to MDR. Also, PMDR achieves only
ejection channels are a single cycle long. For both topologies,           marginal improvements over MDR (0.5% lower average latency
one clock cycle corresponds to a physical length of 2 mm. These           in the mesh). Given its higher allocator complexity, we use MDR
channel lengths are chosen so that both networks cover an area            for the rest of the evaluation.
of about 200 mm2 , a typical die size in modern processes [22].
   We assume a two-stage router design. The VC network                                   V. ROUTER M ICROARCHITECTURE
features input-first separable round-robin allocators, speculative           This section explores router microarchitecture issues pertinent
switch allocation [21] and input buffer bypassing [24]. No                to our study of bufferless networks.
                                                                      2
                                                                                                   X axis
                                                                                             S                                                                         S
Y axis
D D
                                                                                         (a) MDR requests all productive                                             (b) PMDR prioritizes the output with the most
                                                                                         outputs at each hop.                                                        hops remaining in that dimension.
                                                                                                          Figure 1.      MDR and PMDR routing algorithms for deflection bufferless networks.
                                                                                       2D mesh. Routing Comparison. Width 64 bits. Uniform traffic                                                                  2D FBFly. Routing comparison. Width 64 bits. Uniform traffic
                                                                                 55                                                                                                                           45
                                                                                            DOR                                                                                                                          DOR
                                                                                            MDR                                                                                                               40         MDR = PMDR
                                                                                 50
                                                                                            PMDR
45 35
40 30
35 25
30 20
25 15
20 10
                                                                                 15                                                                                                                            5
                                                                                   0         5       10       15          20       25     30   35                                                               0         5       10       15          20       25     30    35
                                                                                                     Injection rate (flits/cycle * 100)                                                                                           Injection rate (flits/cycle * 100)
Grant signals
                                                                                                                                                                                                                                        Table I
                                 Local request (lowest prio)                                                                                                                                                               A LLOCATOR SYNTHESIS COMPARISON .
                                                                                                                                                                 3
                                                       8x8 2D mesh. Width 64 bits. Uniform traffic                                                              8x8 2D mesh. Width 64 bits. Uniform traffic
                             100                                                                                                            10
                                                 VC−buffered                                                                                              VC−buffered
                                           90    Deflection                                                                                          9    Deflection
          Average latency (clock cycles)
                                           80
20 3
                                           10                                                                                                        2
                                             0         10            20             30          40         50                                         0         10            20             30          40        50
                                                             Injection rate (flits/cycle * 100)                                                                       Injection rate (flits/cycle * 100)
                                                    4x4 2D FBFly. Width 64 bits. Uniform traffic                                                                4x4 2D FBFly. Width 64 bits. Uniform traffic
                                           70                                                                                                        9
                                                 VC−buffered                                                                                              VC−buffered
                                                 Deflection                                                                                          8    Deflection
                                           60
          Average latency (clock cycles)
                                                                                                                                                     6
                                           40
                                                                                                                                                     5
                                           30
                                                                                                                                                     4
                                           20
                                                                                                                                                     3
10 2
                                            0                                                                                                        1
                                             0    10         20       30          40       50        60    70                                         0    10         20       30          40        50       60   70
                                                             Injection rate (flits/cycle * 100)                                                                       Injection rate (flits/cycle * 100)
                                                                                                                  4
                                       8x8 2D mesh. Uniform traffic. 20% flit injection rate
                                 60                                                                     power to achieve equal throughput compared to BLESS. The
                                                                             VC−buffered                deflection network provides 5% more throughput per unit area
                                                                             Deflection
                                 50                                                                     due to the buffers occupying 30% of the area, as explained
                                                                                                        in Section VI-B. Consequently, the deflection network requires
      Percentage of occurences
                                                                                                    5
                                                                              8x8 2D mesh. Average of 6 traffic patterns                                                                             8x8 2D mesh. Average of the 6 traffic patterns
                                                             14                                                                                                                       14
10 10
8 8
6 6
4 4
2 2
                                                                0                                                                                                                             0
                                                                 0        5           10       15       20           25     30                                                                 0   0.5        1         1.5      2     2.5    3      3.5   4
                                                                                      Power consumption (W)                                                                                                              Area (square mm)
                                                                          4x4 2D FBFly. Average of the 8 traffic patterns                                                                                4x4 2D FBFly. Average of the 8 traffic patterns
                                                                11                                                                                                                      11
             Average maximum throughput (packets/cycle * 100)
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
                                                                 1                                                                                                                            1
                                                                  0   2          4      6      8      10    12         14   16                                                                 0     0.5            1          1.5       2         2.5     3
                                                                                      Power consumption (W)                                                                                                             Area (square mm)
(1% more) throughput per unit area for the mesh, and 6%                                                                                fraction of the system-wide SRAM memory in many designs
more for the FBFly. This increase in area efficiency for the                                                                           (e.g. CMPs with multi-megabyte caches). Alternatively, flits that
VC network is due to differential signaling, which doubles                                                                             cannot be buffered can be dropped, deflected back to the router,
the channel area, thus reducing the percentage of the total                                                                            or extra complexity needs to be added, such as feedback to the
area occupied by the buffers to 19%. Moreover, the deflection                                                                          router so that flits are sent to the ejection port only if there
network consumes less power for flit injection rates smaller than                                                                      is buffer space for them. This issue becomes more severe with
11% for the mesh, and 8% for the FBFly. However, compared to                                                                           more complex protocols.
the VC network, the power consumed by the deflection network
is never less than 98.5% for the mesh and 99% for the FBFly.                                                                           E. Flit Injection
Figure 9 illustrates the results for the mesh.                                                                                            Injection in deflecting flow control requires feedback from
                                                                                                                                       the router because at least one output port must be free [18].
D. Deadlock and Endpoint Buffers                                                                                                       However, acquiring this information is problematic, specially
   In a network with a request-reply protocol, destinations might                                                                      if the round-trip distance between the router and the source is
be waiting for replies to their own requests before being able                                                                         more than one clock cycle. Alternatively, flits can be deflected
to serve other requests [10]. Those replies might be sent from                                                                         back to the source if there is no free output. However, this causes
a distant source or might face heavy contention. Therefore,                                                                            contention with ejecting flits and costs extra energy. In any case,
arriving requests might find the destination’s ejection buffers to                                                                     the injection buffer size may need to be increased to prevent the
be full, without a mechanism to prevent or handle this scenario.                                                                       logic block (e.g. the CPU) from blocking.
   Preventing this requires ejection buffers able to cover for all
possible sources and their maximum outstanding requests. As                                                                            F. Process Technology
an example, in a system with 64 processors where each node                                                                                Our evaluation uses a 32nm high-performance ITRS-based
can have 4 outstanding requests to each of four cache banks                                                                            process as a worst case due to its high leakage current. To
(16 requests total), each processor and cache bank needs to                                                                            illustrate the other extreme, we use the commercial 45nm low-
buffer 256 requests. This requires a total buffer space of 128KB,                                                                      power library used for synthesis in Section V-A. Its leakage
whereas an 8×8 2D mesh with 2 VCs, each having 8 64-                                                                                   current is negligible. With empty buffer bypassing, the deflection
bit buffer slots, needs 20KB. Note that 20KB is only a small                                                                           network never consumes less energy than the VC network. Both
                                                                                                                                   6
                                                                   Chanel power breakdown                                                                                                                    0.8              Area breakdown
                                           Traversal                       Clock                   Flip-flop                   Leakage                                                                                                  Channel
                        VC                                                                                                                                                                                       0.7                    Crossbar
                     BLESS
                           0                        1                 2               3                4               5                                                               6                     0.6                        Output
                                                                                   Power (W)                                                                                                                                            SRAM Buffers
                                                 Router power breakdown (excluding buffers)                                                                                                                      0.5
                                                                                                                                                                                                    Area (mm2)
                                           Xbar traversal          Xbar ctrl        Xbar leakage      Output FF        Output leakage
                        VC                                                                                                                                                                                   0.4
                     BLESS
                         0.00               0.05        0.10        0.15         0.20 0.25          0.30        0.35       0.40                                                      0.45
                                                                                  Power (W)                                                                                                                      0.3
                                                       Buffer power breakdown in VC networks                                                                                                                     0.2
                                           Dynamic                                                                             Leakage
       Bypass
    No bypass                                                                                                                                                                                                    0.1
            0.00                             0.05           0.10          0.15       0.20 0.25                 0.30        0.35                                                      0.40
                                                                                   Power (W)                                                                                                                 0.0
                                         Figure 8.       Power and area breakdowns for the 2D mesh under a 20% flit injection rate with full-swing channels.
                                          8x8 2D mesh. Low−swing channels. Uniform traffic                                                                                                  8x8 2D mesh. Low−swing. Average of the 6 patterns
                                   6                                                                                                                                                 14
5 10
4.5 8
4 6
3.5 4
3 2
                                  2.5                                                                                                                                                 0
                                     0         10               20             30          40                     50                                                                   0     2      4              6      8      10    12   14   16
                                                        Injection rate (flits/cycle * 100)                                                                                                                       Power consumption (W)
Figure 9. Power consumption with varying injection rate and throughput-power Pareto-optimal curves with low-swing channels.
consume approximately the same amount of power even for                                                                        Section VI-D. For a single traffic class, we have shown that at
very low injection rates. Furthermore, the VC mesh described                                                                   least a buffered network with 2 VCs is more efficient than a
in Section VI-A provides 21% more throughput per unit power                                                                    deflection network.
and 10% more throughput per unit area. Therefore, there are
                                                                                                                                  Network size: While network size affects the relevant trade-
no design points that would make the deflection network more
                                                                                                                               offs, smaller networks provide fewer deflection paths. The
efficient in this process.
                                                                                                                               deflection and buffering probabilities are similarly affected by
   Changing process technologies affects the buffer to overall                                                                 size. Thus, none of the two networks is clearly favored by
network power cost ratio. Extremely costly buffer implementa-                                                                  varying network size.
tions would increase this ratio in favor of the bufferless network.
In such processes, the bufferless network might be the most                                                                       Sub-networks: A deflection network design could be divided
efficient choice. However, even the 32nm high-leakage process                                                                  into sub-networks to make it more efficient, but the same is
we used does not fall in this category. In any case, design                                                                    true for the VC network. For each sub-network of the deflection
effort should first be spent on implementing the buffers more                                                                  network, we can apply our findings to design a similar and more
efficiently before considering bufferless networks.                                                                            efficient buffered network.
                                                                                                                                   Dropping flow-control: Dropping flow control faces different
                                                    VII. D ISCUSSION
                                                                                                                               challenges. For example, its allocators are not constrained to
   Our quantitative evaluation tries to cover the design parame-                                                               produce a complete matching. However, dropping flow control
ters that are most likely to affect the tradeoffs between buffered                                                             requires buffering at the sources. Dropping, as deflecting, causes
and bufferless networks. However, it is infeasible to characterize                                                             flits to traverse extra hops, which translates to energy cost and
the full design space quantitatively. In this section we qualita-                                                              increased latency. Therefore, the fundamental tradeoff between
tively discuss the effect of varying additional parameters.                                                                    buffer and extra hop costs remains. However, the number of
   Traffic classes: Systems requiring a large number of traffic                                                                extra hops in dropping networks is affected by topology and
classes or VCs may have allocators slower than the age-                                                                        routing more than in deflection networks. In general, dropping
based allocator of Section V-A. However, more traffic classes                                                                  flow control may be more or less efficient than deflection flow
also increase the demand for endpoint buffering discussed in                                                                   control, depending on a particular network design.
                                                                                                                           7
   Self-throttling sources: In our evaluation, traffic sources do       George Michelogiannakis was supported by a Robert Bosch
not block under any condition (e.g. if a maximum number                 Stanford Graduate Fellowship. Daniel Sanchez was supported
of outstanding requests is reached). Self-throttling sources are        by a Fundacion Caja Madrid Fellowship and a Hewlett-Packard
more likely to be blocked when using a deflection network               Stanford School of Engineering Fellowship.
due to its latency distribution, as discussed in Section VI-A.
                                                                                                    R EFERENCES
Blocking the sources hides the performance inefficiencies of the
                                                                         [1] J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP
network by controlling the network load. This favors network-                on-chip networks,” in Proc. of the 20th annual Intl. Conf. on
level metrics, but penalizes system performance. For example,                Supercomputing, 2006.
in a CMP, blocking the CPUs increases execution time, which              [2] P. Baran, “On distributed communication networks,” in IEEE
is the performance measurable by end users. Complete system                  Trans. on communication systems, 1964.
                                                                         [3] D. U. Becker and W. J. Dally, “Allocator implementations for
implementations are likely to use self-throttling sources. There-            network-on-chip routers,” in Proc. of the Conf. on High Perfor-
fore, performing an equitable comparison requires taking the                 mance Computing Networking, Storage and Analysis, 2009.
number of cycles that sources are blocked into account.                  [4] C. Busch, M. Herlihy, and R. Wattenhofer, “Routing without flow
                                                                             control,” in Proc. of the 13th annual ACM Symp. on Parallel
                     VIII. C ONCLUSIONS                                      Algorithms and Architectures, 2001.
                                                                         [5] W. J. Dally and B. Towles, Principles and Practices of Intercon-
    We have compared state-of-the-art buffered (VC) and deflec-              nection Networks. Morgan Kaufmann Publishers, 2003.
tion (BLESS) flow control schemes. We improve the bufferless             [6] W. J. Dally, “Virtual-channel flow control,” IEEE Trans. on Par-
                                                                             allel and Distributed Systems, vol. 3, no. 2, 1992.
network by proposing MDR to reduce deflections. This reduces             [7] W. J. Dally and B. Towles, “Route packets, not wires: On-chip
average latency by 5% in an 8×8 2D mesh, compared to DOR.                    interconnection networks,” in Proc. of the 38th annual Design
We also assume efficient SRAM-based buffers that are bypassed                Automation Conf., 2001.
                                                                         [8] C. Gomez, M. Gomez, P. Lopez, and J. Duato, “BPS: A bufferless
if they are empty and there is no contention. The deflection                 switching technique for NoCs,” in Workshop on Interconnection
network with MDR consumes less power up to a flit injection                  Network Architectures, 2008.
rate of 7%. However, it never consumes less power than 98.7%             [9] C. Gómez, M. E. Gómez, P. López, and J. Duato, “Reducing packet
of that of the VC network. Networks constantly operating at low              dropping in a bufferless NoC,” in Proc. of the 14th intl. Euro-Par
                                                                             conf. on Parallel Processing, 2008.
injection rates are likely overdesigned as they don’t need such         [10] A. Hansson, K. Goossens, and A. Rădulescu, “Avoiding message-
large datapaths.                                                             dependent deadlock in network-based systems on chip,” VLSI
    In the same 8×8 2D mesh, VC flow control provides a 12%                  Design, 2007.
                                                                        [11] R. Ho, K. Mai, and M. Horowitz, “Efficient on-chip global
smaller average latency compared to deflection flow control. At              interconnects,” in Symp. on VLSI Circuits, 2003.
a flit injection rate of 20%, the average VC network blocking           [12] International Technology Roadmap for Semiconductors, 2007 Edi-
flit latency is 0.75 cycles with a standard deviation of 1.18,               tion, www.itrs.net.
while for the deflection network the average deflection latency         [13] A. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: A fast and
                                                                             accurate noc power and area model for early-stage design space
is 4.87 cycles with a standard deviation of 8.09. The VC                     exploration,” in Proc. of the conf. on Design, Automation and Test
network achieves a 21% higher throughput per unit power.                     in Europe, 2009.
Furthermore, the BLESS allocator has an 81% larger cycle                [14] J. Kim, W. J. Dally, and D. Abts, “Flattened butterfly: a cost-
time than a separable input-first round-robin speculative switch             efficient topology for high-radix networks,” in Proc. of the 34th
                                                                             annual Intl. Symp. on Computer Architecture, 2007.
allocator. Finally, bufferless flow control needs large buffering       [15] J. Liu, L.-R. Zheng, and H. Tenhunen, “A guaranteed-throughput
or extra complexity at network destinations in the presence of               switch for network-on-chip,” in Proc. of the Intl. Symp. on System-
a communication protocol.                                                    on-Chip, 2003.
                                                                        [16] Z. Lu, M. Zhong, and A. Jantsch, “Evaluation of on-chip networks
    Our work extends previous research on deflection flow con-               using deflection routing,” in Proc. of the 16th ACM Great Lakes
trol by performing a comprehensive comparison with buffered                  symp. on VLSI, 2006.
flow control. Our main contribution is providing insight and            [17] G. Michelogiannakis, J. Balfour, and W. J. Dally, “Elastic buffer
improving the understanding of the issues faced by deflection                flow control for on-chip networks,” in Proc. of the 15th Intl. Symp.
                                                                             on High-Performance Computer Architecture, 2009.
flow control.                                                           [18] T. Moscibroda and O. Mutlu, “A case for bufferless routing in
    Our results show that unless process constraints lead to                 on-chip networks,” in Proc. of the 36th annual Intl. Symp. on
excessively costly buffers, the performance, cost and complexity             Computer Architecture, 2009.
                                                                        [19] C. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M. S. Yousif,
penalties outweigh the potential gains from removing the router              and C. R. Das, “ViChaR: A dynamic virtual channel regulator for
buffers. Even for the limited operation range where the buffer-              network-on-chip routers,” in Proc. of the 39th annual Intl. Symp.
less network consumes less energy, that energy is negligible (up             on Microarchitecture, 2006.
to 1.5%) and is accompanied by the shortcomings presented               [20] C. Nicopoulos, A. Yanamandra, S. Srinivasan, V. Narayanan, and
                                                                             M. J. Irwin, “Variation-aware low-power buffer design,” in Proc.
in this paper. Therefore, we believe that design effort should               of The Asilomar Conf. on Signals, Systems, and Computers, 2007.
be spent on more efficient buffers before considering bufferless        [21] L.-S. Peh and W. J. Dally, “A delay model and speculative
flow control.                                                                architecture for pipelined routers,” in Proc. of the 7th Intl. Symp.
                                                                             on High-Performance Computer Architecture, 2001.
                    ACKNOWLEDGEMENTS                                    [22] D. Sanchez, G. Michelogiannakis, and C. Kozyrakis, “An analysis
                                                                             of interconnection networks for large scale chip-multiprocessors,”
  We sincerely thank Daniel Becker, Nathan Binkert, Jung Ho                  ACM Trans. on Arch. and Code Opt., vol. 7, no. 1, 2010.
Ahn and the anonymous reviewers for their useful comments.              [23] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: a power-
                                                                             performance simulator for interconnection networks,” in Proc. of
This work was supported by the National Science Foundation                   the 35th annual ACM/IEEE Intl. Symp. on Microarchitecture, 2002.
under Grant CCF-0702341, the National Security Agency under             [24] H. Wang, L.-S. Peh, and S. Malik, “Power-driven design of router
Contract H98230-08-C-0272, the Stanford Pervasive Parallelism                microarchitectures in on-chip networks,” in Proc. of the 36th
Lab, the Gigascale Systems Research Center (FCRP/GSRC).                      annual IEEE/ACM Intl. Symp. on Microarchitecture, 2003.