Computer Network Concise
Computer Network Concise
AConcise Introduction
Synthesis Lectures on
Communication Networks
Editor
Jean Walrand, University of California, Berkeley
Synthesis Lectures on Communication Networks is an ongoing series of 80 to 160 page publications
on topics on the design, implementation, and management of communication networks. Each
lecture is a self-contained presentation of one topic by a leading expert. The topics range from
algorithms to hardware implementations and cover a broad spectrumof issues fromsecurity to
multiple-access protocols. The series addresses technologies fromsensor networks to
recongurable optical networks.
The series is designed to:
Provide the best available presentations of important aspects of communication networks.
Help engineers and advanced students keep up with recent developments in a rapidly
evolving technology.
Facilitate the development of courses in this eld.
Communication Networks: AConcise Introduction
Jean Walrand and ShyamParekh
2010
Path Problems in Networks
John S. Baras and George Theodorakopolous
2010
Performance Modeling, Loss Networks, and Statistical Multiplexing
Ravi R. Mazumdar
2009
Network Simulation
Richard M. Fujimoto, Kalyan S. Perumalla, and George F. Riley
2006
Copyright 2010 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
Communication Networks: A Concise Introduction
Jean Walrand and Shyam Parekh
www.morganclaypool.com
ISBN: 9781608450947 paperback
ISBN: 9781608450954 ebook
DOI 10.2200/S00241ED1V01Y201002CNT004
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON COMMUNICATION NETWORKS
Lecture #4
Series Editor: Jean Walrand, University of California, Berkeley
Series ISSN
Synthesis Lectures on Communication Networks
Print 1935-4185 Electronic 1935-4193
Communication Networks
AConcise Introduction
Jean Walrand
University of California, Berkeley
Shyam Parekh
Bell Labs, Alcatel-Lucent
SYNTHESIS LECTURES ON COMMUNICATION NETWORKS #4
C
M
&
cLaypool Morgan publishers
&
ABSTRACT
This book results from many years of teaching an upper division course on communication networks
in the EECS department at University of California, Berkeley. It is motivated by the perceived
need for an easily accessible textbook that puts emphasis on the core concepts behind current and
next generation networks. After an overview of how todays Internet works and a discussion of the
main principles behind its architecture, we discuss the key ideas behind Ethernet, WiFi networks,
routing, internetworking and TCP. To make the book as self contained as possible, brief discussions
of probability and Markov chain concepts are included in the appendices. This is followed by a brief
discussion of mathematical models that provide insight into the operations of network protocols.
Next, the mainideas behindthe newgenerationof wireless networks basedonWiMAXandLTE, and
the notion of QoS are presented. A concise discussion of the physical layer technologies underlying
various networks is also included. Finally, a sampling of topics is presented that may have signicant
inuence on the future evolution of networks including overlay networks like content delivery and
peer-to-peer networks, sensor networks, distributed algorithms, Byzantine agreement and source
compression.
KEYWORDS
Internet, Ethernet, WiFi, Routing, Bellman-Ford algorithm, Dijkstra algorithm, TCP,
Congestion Control, Flow Control, QoS, WiMAX, LTE, Peer-to-Peer Networks
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1
The Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hosts, Routers, Links 1
1.1.2 Packet Switching 1
1.1.3 Addressing 2
1.1.4 Routing 2
1.1.5 Error Detection 4
1.1.6 Retransmission of Erroneous Packets 4
1.1.7 Congestion Control 5
1.1.8 Flow Control 5
1.2 DNS, HTTP & WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 DNS 5
1.2.2 HTTP & WWW 6
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2
Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Link Rate 10
2.2.2 Link Bandwidth and Capacity 10
2.2.3 Throughput 11
2.2.4 Delay 12
viii CONTENTS
2.2.5 Delay Jitter 13
2.2.6 M/M/1 Queue 13
2.2.7 Littles Result 15
2.2.8 Fairness 16
2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Location-Based Addressing 17
2.3.2 Two-Level Routing 18
2.3.3 Best Effort Service 18
2.3.4 End-to-End Principle and Stateless Routers 19
2.3.5 Hierarchical Naming 19
2.4 Application and Technology Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Layers 20
2.5 Application Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Client/Server 21
2.5.2 P2P 21
2.5.3 Cloud Computing 22
2.5.4 Content Distribution 22
2.5.5 Multicast/Anycast 22
2.5.6 Push/Pull 23
2.5.7 Discovery 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3
Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Typical Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 History of Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Aloha Network 28
3.2.2 Cable Ethernet 29
3.2.3 Hub Ethernet 30
3.2.4 Switched Ethernet 31
CONTENTS ix
3.3 Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Switched Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 Example 32
3.6.2 Learning 32
3.6.3 Spanning Tree Protocol 33
3.7 Aloha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Time-Slotted Version 35
3.8 Non-Slotted Aloha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Hub Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9.1 Maximum Collision Detection Time 36
3.10 Appendix: Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10.1Probability 37
3.10.2Additivity for Exclusive Events 38
3.10.3Independent Events 38
3.10.4Slotted Aloha 39
3.10.5Non-Slotted Aloha 39
3.10.6Waiting for Success 40
3.10.7Hub Ethernet 41
3.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4
WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Medium Access Control (MAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 MAC Protocol 46
4.2.2 Enhancements for Medium Access 47
4.2.3 MAC Addresses 48
4.3 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
x CONTENTS
4.4 Efciency Analysis of MAC Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Single Device 49
4.4.2 Multiple Devices 50
4.5 Appendix: Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5
Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Domains and Two-Level Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Scalability 59
5.1.2 Transit and Peering 60
5.2 Inter-Domain Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Path Vector Algorithm 61
5.2.2 Possible Oscillations 62
5.2.3 Multi-Exit Discriminators 63
5.3 Intra-Domain Shortest Path Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Dijkstras Algorithm and Link State 64
5.3.2 Bellman-Ford and Distance Vector 65
5.4 Anycast, Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 Anycast 67
5.4.2 Multicast 68
5.4.3 Forward Error Correction 69
5.4.4 Network Coding 70
5.5 Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.1 AODV 72
5.5.2 OLSR 72
5.5.3 Ant Routing 73
5.5.4 Geographic Routing 73
5.5.5 Backpressure Routing 73
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
CONTENTS xi
5.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6
Internetworking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Basic Components: Mask, Gateway, ARP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.1 Addresses and Subnets 78
6.2.2 Gateway 79
6.2.3 DNS Server 79
6.2.4 ARP 79
6.2.5 Conguration 80
6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.1 Same Subnet 80
6.3.2 Different Subnets 80
6.3.3 Finding IP Addresses 81
6.3.4 Fragmentation 82
6.4 DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7
Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1 Transport Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Transport Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 TCP States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4.1 Stop-and-Wait 90
7.4.2 Go Back N 90
7.4.3 Selective Acknowledgments 91
7.4.4 Timers 92
xii CONTENTS
7.5 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.5.1 AIMD 93
7.5.2 Renements: Fast Retransmit and Fast Recovery 94
7.5.3 Adjusting the Rate 95
7.5.4 TCP Window Size 96
7.5.5 Terminology 96
7.6 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.1 The Role of Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.1 Fairness vs. Throughput 102
8.2.2 Distributed Congestion Control 104
8.3 Dynamic Routing and Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.4 Appendix: Justication for Primal-Dual Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9
WiMAX<E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.1 Technology Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2 Key Aspects of WiMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.2.1 OFDMA 114
9.2.2 Quality of Service (QoS) Classes 116
9.2.3 Schedulers 117
9.2.4 Handovers 118
9.2.5 Miscellaneous WiMAX Features 118
9.3 Key Aspects of LTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
CONTENTS xiii
9.3.1 LTE Architecture 119
9.3.2 Physical Layer 119
9.3.3 QoS Support 122
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10
QOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.2 Trafc Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.2.1Leaky Buckets 126
10.2.2Delay Bounds 126
10.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.3.1GPS 128
10.3.2WFQ 129
10.4 Regulated Flows and WFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.5 End-to-End QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.6 End-to-End Admission Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.7 Net Neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11
Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.1 How to Transport Bits? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.2 Link Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.3 Wired and Wireless Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.3.1Modulation Schemes: BPSK, QPSK, QAM 137
11.3.2Inter-Cell Interference and OFDM 139
11.4 Optical Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.4.1Operation of Fiber 141
xiv CONTENTS
11.4.2OOK Modulation 142
11.4.3Wavelength Division Multiplexing 142
11.4.4Optical Switching 143
11.4.5Passive Optical Network 144
11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
12
Additional Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
12.1 Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
12.1.1Applications: CDN and P2P 147
12.1.2Routing in Overlay Networks 149
12.2 How Popular P2P Protocols Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
12.2.11st Generation: Server-Client based 149
12.2.22nd Generation: Centralized Directory based 149
12.2.33rd Generation: Purely Distributed 150
12.2.4Advent of Hierarchical Overlay - Super Nodes 150
12.2.5Advanced Distributed File Sharing: BitTorrent 150
12.3 Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.3.1Design Issues 151
12.4 Distributed Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.4.1Bellman-Ford Routing Algorithm 153
12.4.2TCP 154
12.4.3Power Adjustment 155
12.5 Byzantine Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12.5.1Agreeing over an Unreliable Channel 157
12.5.2Consensus in the Presence of Adversaries 158
12.6 Source Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
12.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
CONTENTS xv
Authors Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Preface
These lecture notes are based on an upper division course on communication networks that
the authors have taught in the Department of Electrical Engineering and Computer Sciences of the
University of California at Berkeley.
Over the thirty years that we have taught this course, networks have evolved from the early
Arpanet and experimental versions of Ethernet to a global Internet with broadband wireless access
and new applications from social networks to sensor networks.
We have used many textbooks over these years. The goal of the notes is to be more faithful
to the actual material we present. In a one semester course, it is not possible to cover an 800-page
book. Instead, in the course and in these notes we focus on the key principles that we believe the
students should understand. We want the course to showthe forest as much as the trees. Networking
technology keeps on evolving. Our students will not be asked to re-invent TCP/IP. They need a
conceptual understanding to continue inventing the future.
Many colleagues take turn teaching the Berkeley course. This rotation keeps the material
fresh and broad in its coverage. It is a pleasure to acknowledge the important contributions to the
material presented here of Kevin Fall, Randy Katz, Steve McCanne, Abhay Parekh, Vern Paxson,
Scott Shenker, Ion Stoica, DavidTse, and AdamWolicz. We also thank the many teaching assistants
who helped us over the years and the inquisitive Berkeley students who always keep us honest.
We are grateful to reviewers of early drafts of this material. In particular Assane Gueye, Libin
Jiang, Jiwoong Lee, Steven Low, John Musacchio, Jennifer Rexford, and Nikhil Shetty provided use-
ful constructive comments. We thank Mike Morgan of Morgan & Claypool for his encouragement
and his help in getting this text reviewed and published.
The rst author was supported in part by NSF and by a MURI grant from ARO during the
writing of this book. The second author is thankful to his management at Bell Labs, Alcatel-Lucent
for their encouragement.
Most importantly, as always, we are deeply grateful to our families for their unwavering
support.
Jean Walrand and Shyam Parekh
Berkeley, February 2010
1
C H A P T E R 1
The Internet
The Internet grewfroma small experiment in the late 1960s to a network that connects a billion users
and has become societys main communication system. This phenomenal growth is rooted in the
architecture of the Internet that makes it scalable, exible, and extensible, and provides remarkable
economies of scale. In this chapter, we explain how the Internet works.
1.1 BASICOPERATIONS
The Internet delivers information by rst arranging it into packets. This section explains howpackets
get to their destination, how the network corrects errors and controls congestion.
1.1.1 HOSTS, ROUTERS, LINKS
The Internet consists of hosts and routers attached to each other with links. The hosts are sources
or sinks of information. The name hosts indicates that these devices host the applications that
generate or use the information they exchange over the network. Hosts include computers, printers,
servers, web cams, etc. The routers receive information and forward it to the next router or host. A
link transports bits between two routers or hosts. Links are either optical, wired (including cable),
or wireless. Some links are more complex and involve switches, as we study later. Figure 1.1 shows
a few hosts and routers attached with links. The clouds represent other sets of routers and links.
1.1.2 PACKETSWITCHING
The original motivation for the Internet was to build a network that would be robust against attacks
on some of its parts. The initial idea was that, should part of the network get disabled, routers
would reroute information automatically along alternate paths. This exible routing is based on
packet switching. Using packet switching, the network transports bits grouped in packets. A packet is
a string of bits arranged according to a specied format. An Internet packet contains its source and
destination addresses. Figure 1.1 shows a packet with its source address A and destination address
B. Switching refers to the selection of the set of links that a packet follows from its source to its
destination. Packet switching means that the routers make this selection individually for each packet.
In contrast, the telephone network uses circuit switching where it selects the set of links only once
for a complete telephone conversation.
2 1. THEINTERNET
128.132.154.46
169.229.60.32 = A
coeus.eecs.berkeley.edu
18.7.25.81
sloan.mit.edu
128
169 18
18.64
18
169
18.128.33.11
18.128
64
169
110.27.36.91= B
64
64
[A|B| ... |CRC]
R1 R2
R3
L1
L2
Figure 1.1: Hosts, routers, and links. Each host has a distinct location-based 32-bit IP address. The
packet header contains the source and destination addresses and an error checksum. The routers
maintain routing tables that specify the output for the longest prex match of the destination
address.
1.1.3 ADDRESSING
Every computer or other host attached to the Internet has a unique address specied by a 32-bit
string called its IPaddress, for Internet Protocol Address. The addresses are conventionally written in
the forma.b.c.d where a, b, c, d are the decimal value of the four bytes. For instance, 169.229.60.32
corresponds to the four bytes 10101001.11100101.00111100.00100000.
1.1.4 ROUTING
Each router determines the next hop for the packet from the destination address. While advancing
towards the destination, within a network under the control of a common administrator, the packets
essentially follow the shortest path.
1
The routers regularly compute these shortest paths and record
them as routing tables.
2
A routing table species the next hop for each destination address, as
sketched in Figure 1.2.
1
The packets typically go through a set of networks that belong to different organizations. The routers select this set of networks
according to rules that we discuss in the Routing chapter.
2
More precisely, the router consults a forwarding table that indicates the output port of the packet. However, this distinction is not
essential.
1.1. BASICOPERATIONS 3
S | D |
D port 1
F port 2
K port 16
Routing Table
Input Ports Output Ports
1
2
32
1
2
32
Source
Destination
Destination
Figure 1.2: The gure sketches a router with 32 input ports (link attachments) and 32 output ports.
Packets contain their source and destination addresses. Arouting table species, for each destination,
the corresponding output port for the packet.
To simplify the routing tables, the network administrators assign IP addresses to hosts based
on their location. For instance, router R1 in Figure 1.1 sends all the packets with a destination
address whose rst byte has decimal value 18 to router R2 and all the packets with a destination
address whose rst byte has decimal value 64 to router R3. Instead of having one entry for every
possible destination address, a router has one entry for a set of addresses with a common initial bit
string, or prex. If one could assign addresses so that all the destinations that share the same initial
ve bits were reachable from the same port of a 32-port router, then the routing table of the router
would need only 32 entries of 5 bits: each entry would specify the initial ve bits that correspond
to each port. In practice, the assignment of addresses is not perfectly regular, but it nevertheless
reduces considerably the size of routing tables. This arrangement is quite similar to the organization
of telephone numbers into [country code, area code, zone, number]. For instance, the number 1 510
642 1529 corresponds to a telephone set in the US (1), in Berkeley (510), the zone of the Berkeley
campus (642).
The general approach to exploit location-based addressing is to nd the longest common
initial string of bits (called prex) in the addresses that are reached through the same next hop.
This scheme, called Classless Interdomain Routing (or CIDR), enables to arrange the addresses into
subgroups identied by prexes in a exible way. The main difference with the telephone numbering
scheme is that, in CIDR, the length of the prex is not predetermined, thus providing more exibility.
As an illustration of longest prex match routing, consider how router R2 in Figure 1.1 selects
where to send packets. A destination address that starts with the bits 000010101 matches the rst 9
bits of the prex 18.128 = 0001001010000000 of output link L2 but only the rst 8 bits of the prex
4 1. THEINTERNET
18.64 = 0001001001000000 of output link L1. Consequently, a packet with destination address
18.128.33.11 leaves R2 via link L2. Similarly, a packet with destination address 18.7.25.81 leaves
R2 via link L1. Summarizing, the router nds the prex in its routing table that has the longest
match with the initial bits of the destination address of a packet. That prex determines the output
port of the packet.
1.1.5 ERRORDETECTION
A node sends the bits of a packet to the next node by rst converting them into electrical or optical
signals. The receiving node converts the signals back into bits. This process is subject to errors caused
by random uctuations in the signals. Thus, it occasionally happens that some bits in a packet get
corrupted, which corresponds to a transmission error.
A simple scheme to detect errors is for the source to add one bit, called parity bit, to the
packet so that the number of ones is even. For instance, if the packet is 00100101, the sending
node adds a parity bit equal to 1 so that the packet becomes 001001011 and has an even number
of ones. If the receiver gets a packet with an odd number of ones, say 001101011, it knows that a
transmission error occurred. This simple parity bit scheme cannot detect if two or any even number
of bits were modied during the transmission. This is why the Internet uses a similar error detection
code for the packet headers, referred to as the header checksum. When using the header checksum,
the sending node calculates the checksum bits (typically 16 bits) from the other elds in the header.
The receiving node performs the same calculation and compares the result with the error detection
bits in the packet; if they differ, the receiving node knows that some error occurred in the packet
header, and it discards the corrupted packet.
3
1.1.6 RETRANSMISSIONOF ERRONEOUS PACKETS
In addition to dropping packets whose header is corrupted by transmission errors, a router along
the path may discard arriving packets when it runs out of memory to temporarily store them before
forwarding. This event occurs when packets momentarily arrive at a router faster than it can forward
themto their next hop. Such packet losses are said to be due to congestion, as opposed to transmission
errors.
To implement a reliable delivery, the source and destination use a mechanism that guarantees
that the source retransmits the packets that do not reach the destination without errors. Such a
scheme is called an automatic retransmission request or ARQ The basic idea of this mechanism is that
the destination acknowledges all the correct packets it receives, and the source retransmits packets
that the destination did not acknowledge within a specic amount of time. Say that the source sends
the packets 1, 2, 3, 4 and that the destination does not get packet 2. After a while, the source notices
that the destination did not acknowledge packet 2 and retransmits a copy of that packet. We discuss
3
We explain in the Transport chapter that the packet may also contain a checksum that the source calculates on the complete
packet and that the destination checks to make sure that it does not ignore errors in the rest of the packet.
1.2. DNS, HTTP &WWW 5
the specic implementation of this mechanism in Internet in the chapter on the transport protocol.
Note that the source and destination hosts arrange for retransmissions, not the routers.
1.1.7 CONGESTIONCONTROL
Imagine that many hosts send packets that go though a common link in the network. If the hosts send
packets too fast, the link cannot handle them all and the router with that outgoing link must discard
some packets. To prevent an excessive number of discarded packets, the hosts slow down when they
miss acknowledgments. That is, when a host has to retransmit a packet whose acknowledgment
failed to arrive, it assumes that a congestion loss occurred and slows down the rate at which it sends
packets.
Eventually, congestionsubsides andlosses stop. As long as the hosts get their acknowledgments
in a timely manner, they slowly increase their packet transmission rate to converge to the maximum
rate that can be supported by the prevailing network conditions. This scheme, called congestion
control, automatically adjusts the transmission of packets so that the network links are fully utilized
while limiting the congestion losses.
1.1.8 FLOWCONTROL
If a fast device sends packets very rapidly to a slowdevice, the latter may be overwhelmed. To prevent
this phenomenon, the receiving device indicates, in each acknowledgment it sends back to the source,
the amount of free buffer space it has to receive additional bits. The source stops transmitting when
this available space is not larger than the number of bits the source has already sent and the receiver
has not yet acknowledged.
The source combines the ow control scheme with the congestion control scheme discussed
earlier. Note that owcontrol prevents overowing the destination buffer whereas congestion control
prevents overowing router buffers.
1.2 DNS, HTTP &WWW
1.2.1 DNS
The hosts attached to the Internet have a name in addition to an IP address. The names are easier to
remember (e.g., google.com). To send packets to a host, the source needs to know the IP address
of that host. The Internet has an automatic directory service called the Domain Name Service, or DNS,
that translates the name into an IP address. DNS is a distributed directory service. The Internet is
decomposed into zones, and a separate DNS server maintains the addresses of the hosts in each zone.
For instance, the department of EECS at Berkeley maintains the directory server for the hosts in
the eecs.berkeley.edu zone of the network. The DNS server for that zone answers requests
for the IP address of hosts in that zone. Consequently, if one adds a host on the network of our
department, one needs to update only that DNS server.
6 1. THEINTERNET
1.2.2 HTTP &WWW
The World Wide Web is arranged as a collection of hyperlinked resources such as web pages, video
streams, and music les. The resources are identied by a Uniform Resource Locator or URL that
species a computer and a le in that computer together with the protocol that should deliver the
le.
For instance, the URL http://www.eecs.berkeley.edu/wlr.html species a home
page in a computer with name www.eecs.berkeley.edu and the protocol HTTP.
HTTP, the Hyper Text Transfer Protocol, species the request/response rules for getting the
le from a server to a client. Essentially, the protocol sets up a connection between the server to the
client, then requests the specic le, and nally closes the connection when the transfer is complete.
1.3 SUMMARY
The Internet consists of hosts that send and/or receive information, routers, and links.
Each host has a 32-bit IP address and a name. DNS is a distributed directory service that
translates the name into an IP address.
The hosts arrange the information into packets that are groups of bits with a specied format.
A packet includes its source and destination addresses and error detection bits.
The routers calculate the shortest paths (essentially) to destinations and store them in routing
tables. The IP addresses are based on the location of the hosts to reduce the size of routing
tables using longest prex match.
A source adjusts its transmissions to avoid overowing the destination buffer (ow control)
and the buffers of the routers (congestion control).
Hosts remedy transmission and congestion losses by using acknowledgments, timeouts, and
retransmissions.
1.4 PROBLEMS
P1.1 How many hosts can one have on the Internet if each one needs a distinct IP address?
P1.2 If the addresses were allocated arbitrarily, how many entries should a routing table have?
P1.3 Imagine that all routers have 16 ports. In the best allocation of addresses, what is the size of
the routing table required in each router?
P1.4 Assume that a host A in Berkeley sends a stream of packets to a host B in Boston. Assume
also that all links operate at 100Mbps and that it takes 130ms for the rst acknowledgment
to come back after A sends the rst packet. Say that A sends one packet of 1KByte and then
1.5. REFERENCES 7
waits for an acknowledgment before sending the next packet, and so on. What is the long-term
average bit rate of the connection? Assume now that A sends N packets before it waits for the
rst acknowledgment, and that A sends the next packet every time an acknowledgement is
received. Express the long-term average bit rate of the connection as a function of N. [Note:
1Mbps = 10
6
bits per second; 1ms = 1 millisecond = 10
3
second.]
P1.5 Assume that a host A in Berkeley sends 1-KByte packets with a bit rate of 100Mbps to a host
B in Boston. However, B reads the bits only at 1 Mbps. Assume also that the device in Boston
uses a buffer that can store 10 packets. Explore the ow control mechanism and provide a time
line of the transmissions.
1.5 REFERENCES
Packet switching was independently invented in the early 1960s by Paul Baran (5), Donald Davies,
and Leonard Kleinrock who observed, in his MIT Thesis (25), that packet-switched networks can
be analyzed using queuing theory. Bob Kahn and Vint Cerf invented the basic structure of TCP/IP
in 1973 (26). The congestion control of TCP was corrected by Van Jacobson in 1988 (22) based
on the analysis by Chiu and Jain (11) and (12). Paul Mockapetris invented DNS in 1983. CIDR is
described in (67). Tim Berners-Lee invented the WWW in 1989.
9
C H A P T E R 2
Principles
In networking, connectivity is the name of the game. The Internet connects about one billion
computers across the world, plus associated devices such as printers, servers, and web cams. By
connect, we mean that the Internet enables these hosts to transfer les or bit streams among each
other. To reduce cost, all these hosts share communication links. For instance, many users in the same
neighborhood may share a coaxial cable to connect to the Internet; many communications share the
same optical bers. With this sharing, the network cost per user is relatively small.
To accommodate its rapid growth, the Internet is organized in a hierarchical way and adding
hosts to it only requires local modications. Moreover, the network is independent of the applications
that it supports. That is, only the hosts need the application software. For instance, the Internet
supports video conferences even though its protocols were designed before video conferences existed.
In addition, the Internet is compatible with new technologies such as wireless communications or
improvements in optical links.
We comment on these important features in this chapter. In addition, we introduce some
metrics that quantify the performance characteristics of a network.
2.1 SHARING
Consider the roughly 250 million computers in the USA and imagine designing a network to
interconnect them. It is obviously not feasible to connect each pair of computers with a dedicated
link. Instead, one attaches computers that are close to each other with a local network. One then
attaches local networks together with an access network. One attaches the access networks together
in a regional network. Finally, one connects the regional networks together via one or more backbone
networks that go across the country, as shown in Figure 2.1.
This arrangement reduces considerably the total lengthof wires neededtoattachthe computers
together, when compared with pairwise links. Many computers share links and routers. For instance,
all the computers inthe leftmost access network of the gure share the link Las they send information
to computers in another access network. What makes this sharing possible is the fact that the
computers do not transmit all the time. Accordingly, only a small fraction of users are active and
share the network links and routers, which reduces the investment required per user.
As an example, if an average user sends data into the network for 20 minutes during the
eight-hour business day, then he is active only a fraction 20/(60 8) = 1/24 of the time. Assume
that link L in Figure 2.1 transmits bits at the rate of C bits per second (bps) and that A computers
in the access networks share that link. Then, on average, about A/24 computers are active and share
10 2. PRINCIPLES
Local
Access
Regional
Backbone
L
Figure 2.1: A hierarchy of networks.
the transmission rate of C bps, and each can in principle transmit at the rate of C/(A/24) = 24C/A
bps. The factor 24 here is called the multiplexing gain. This factor measures the benet of sharing a
link among many users.
2.2 METRICS
To clarify the discussion of network characteristics, it is necessary to dene precisely some basic
metrics of performance.
2.2.1 LINKRATE
A link is characterized by a rate. As an example, let us consider a DSL connection. It is characterized
by two rates: one uplink rate (from the users device to the Internet) and one downlink rate (from
the Internet to the users device). Typical values are 768Kbps for the downlink rate and 256Kbps for
the uplink rate. This means that the DSL link transmits bits at the indicated rate in the uplink and
downlink directions.
If the rate of a link is 1Mbps, then the transmitter can send a packet with 1,000 bits in
1ms. If it can send packets back-to-back without any gap, then the link can send a 1MByte le in
approximately 8 seconds.
A link that connects a user to the Internet is said to be broadband if its rate exceeds 100Kbps.
(This value is somewhat arbitrary and is not universal.) Otherwise, one says that the link is narrow-
band .
2.2.2 LINKBANDWIDTHANDCAPACITY
Asignal of the formV(t ) = Asin(2f
0
t ) makes f
0
cycles per second. We say that its frequency is f
0
Hz. Here, Hz stands for Hertz and means one cycle per second. For instance, V(t ) may be the voltage
2.2. METRICS 11
at time t as measured across the two terminals of a telephone line. The physics of a transmission
line limits the set of frequencies of signals that it transports. The bandwidth of a communication
link measures the width of that range of frequencies. For instance, if a telephone line can transmit
signals over a range of frequencies from 300Hz to 1MHz ( = 10
6
Hz), we say that its bandwidth is
about 1MHz.
The rate of a link is related to its bandwidth. Intuitively, if a link has a large bandwidth, it can
carry voltages that change very fast and it should be able to carry many bits per seconds, as different
bits are represented by different voltage values. The maximum rate of a link depends also on the
amount of noise on the link. If the link is very noisy, the transmitter should send bits more slowly.
This is similar to having to articulate more clearly when talking in a noisy room.
An elegant formula due to Claude Shannon indicates the relationship between the maximum
reliable link rate C, also called the Shannon Capacity of the link, its bandwidth W and the noise. That
formula is
C = W log
2
(1 +SNR).
In this expression, SNR, the signal-to-noise ratio, is the ratio of the power of the signal at the
receiver divided by the power of the noise, also at the receiver. For instance, if SNR = 10
6
2
20
and W = 1MHz, then we nd that C = 10
6
log
2
(1 +10
6
) bps 10
6
log
2
(2
20
) bps = 10
6
20
bps = 20Mbps. This value is the theoretical limit that could be achieved, using the best possible
technique for representing bits by signals and the best possible method to avoid or correct errors.
An actual link never quite achieves the theoretical limit, but it may come close.
The formula conrms the intuitive fact that if a link is longer, then its capacity is smaller.
Indeed, the power of the signal at the receiver decreases with the length of the link (for a xed
transmitter power). For instance, a DSL line that is very long has a smaller rate than a shorter line.
The formula also explains why a coaxial cable can have larger rate than a telephone line if one knows
that the bandwidth of the former is wider.
In a more subtle way, the formula shows that if the transmitter has a given power, it should
allocate more power to the frequencies within its spectrum that are less noisy. Over a cello, one hears
a soprano better than a basso profundo. A DSL transmitter divides its power into small frequency
bands, and it allocates more power to the less noisy portions of the spectrum.
2.2.3 THROUGHPUT
Say that you download a large le from a server using the Internet. Assume that the le size is B
bits and that it takes T seconds to download the le. In that case, we say that the throughput of the
transfer is B/T bps. For instance, you may download an MP3 music le of 3MBytes = 24Mbits in
2 minutes, i.e., 120 seconds. The throughput is then 24Mbits/120s = 0.2Mbps or 200Kbps. (For
convenience, we approximate 1MByte by 10
6
bytes even though it is actually 2
20
= 1, 048, 576
bytes. Recall also that one denes 1Mbps = 10
6
bps and 1Kbps = 10
3
bps.)
The link rate is not the same as the throughput. For instance, in the example above, your MP3
download had a throughput of 200Kbps. You might have been using a DSL link with a downlink
12 2. PRINCIPLES
rate of 768Kbps. The difference comes from the fact that the download corresponds to a sequence
of packets and there are gaps between the packets.
Figure 2.2 shows two typical situations where the throughput is less than the link rate. The
left part of the gure shows a source S that sends packets to a destination D. A packet is said to be
outstanding if the sender has transmitted it but has not received its acknowledgment. Assume that
the sender can have up to three outstanding packets. The gure shows that the sender sends three
packets, then waits for the acknowledgment of the rst one before it can send the fourth packet,
and so on. In the gure, the transmitter has to wait because the time to get an acknowledgment
RT T , for round-trip time is longer than the time the transmitter takes to send three packets.
Because of that, waiting time during which the transmitter does not send packets, the throughput of
the connection is less that the rate R of the link. Note that, in this example, increasing the allowed
number of outstanding packets increases the throughput until it becomes equal to the link rate. The
maximum allowed number of outstanding bytes is called the window size. Thus, the throughput is
limited by the window size divided by the round-trip time.
The right part of the gure shows devices A, C, D attached to a router B. Device A sends
packets at rate R. Half the packets go to C and the other half go to D. The throughput of the
connection from A to C is R/2 where R is the rate of the links. Thus, the two connections (from A
to C and from A to D) share the rate R of the link from A to B. This link is the bottleneck of the
system: it is the rate of that link that limits the throughput of the connections. Increasing the rate
of that link would enable to increase the throughput of the connections.
A B C
D
S D R
R R
R
Time
RTT
Figure 2.2: The throughput can be limited by the window size (left) and by a bottleneck link (right).
2.2.4 DELAY
Delay refers to the elapsed time for a packet to traverse between two points of interest. If the two
points of interest are the two end points (e.g., the pair of communicating hosts), we refer to the
delay between these two points as the end-to-end delay. End-to-end delay typically comprises the
transmission and propagation times over the intervening links, and queuing and processing times
2.2. METRICS 13
at the intervening nodes (e.g., routers and switches). Transmission time refers to the time it takes
for a packet to be transmitted over a link at the link rate, propagation time refers to the time for
a physical signal to propagate over a link, queuing time refers to the waiting time for a packet at
a node before it can be transmitted, and processing time refers to the time taken for performing
the required operations on a packet at a node. The concepts of queuing and transmission times are
further discussed in Sections 2.2.6 and 2.2.7.
2.2.5 DELAY JITTER
The successive packets that a source sends to a destination do not face exactly the same delay across
the network. One packet may reach the destination in 50ms whereas another one may take 120ms.
These uctuations in delay are due to the variable amount of congestion in the routers. Apacket may
arrive at a router that has already many other packets to transmit. Another packet may be lucky and
have rarely to wait behind other packets. One denes the delay jitter of a connection as the difference
between the longest and shortest delivery time among the packets of that connection. For instance,
if the delivery times of the packets of a connection range from 50ms to 120ms, the delay jitter of
that connection is 70ms.
Many applications such as streaming audio or video and voice-over-IP are sensitive to delays.
Those applications generate a sequence of packets that the network delivers to the destination for
playback. If the delay jitter of the connection is J, the destination should store the rst packet that
arrives for at least J seconds before playing it back, so that the destination never runs out of packets
to play back. In practice, the value of the delay jitter is not known in advance. Typically, a streaming
application stores the packets for T seconds, say T = 4, before starting to play them back. If the
playback buffer gets empty, the application increases the value of T , and it buffers packets for T
seconds before playing them back. The initial value of T and the rule for increasing T depend on
the application. A small value of T is important for interactive applications such as voice over IP or
video conferences; it is less critical for one-way streaming such as Internet radio, IPTV, or YouTube.
2.2.6 M/M/1 QUEUE
To appreciate effects such as congestion, jitter, and multiplexing, it is convenient to recall a basic result
about delays in a queue. Imagine that customers arrive at a cash register and queue up until they can
be served. Assume that one customer arrives in the next second with probability , independently of
when previous customers arrived. Assume also that the cashier completes the service of a customer
in the next second with probability , independently of how long he has been serving that customer
and of how long it took to serve the previous customers. That is, customers arrive per second, on
average, and the cashier serves customers per second, on average, when there are customers to be
served. Note that the average service time per customer is 1/ second since the casher can serve
customers per second, on average. One denes the utilization of the system as the ratio = /.
Thus, the utilization is 80% if the arrival rate is equal to 80% of the service rate.
14 2. PRINCIPLES
Such a systemis called an M/M/1 queue. In this notation, the rst Mindicates that the arrival
process is memoryless: the next arrival occurs with probability in the next second, no matter how
long it has been since the previous arrival. The second Mindicates that the service is also memoryless.
The 1 in the notation indicates that there is only one server (the cashier).
Assume that < so that the cashier can keep up with the arriving customers. The basic
result is that the average time T that a customer spends waiting in line or being served is given by
T =
1
.
If is very small, then the queue is almost always empty when a customer arrives. In that case,
the average time T is 1/ and is equal to the average time the cashier spends serving a customer.
Consequently, for any given < , the difference T 1/is the average queuing time that a customer
waits behind other customers before getting to the cashier.
Another useful result is that the average number L of customers in line or with the server is
given by
L =
.
Note that T and L grow without bound as increases and approaches .
To apply these results to communication systems, one considers that the customers are packets,
the cashier is a transmitter, and the waiting line is a buffer attached to the transmitter. The average
packet service time 1/ (in seconds) is the average packet length (in bits) divided by the link rate
(in bps). Equivalently, the service rate is the link rate divided by the average length of a packet.
Consider a computer that generates packets per second and that these packets arrive at a link that
can send packets per second. The formulas above provide the average delay T per packet and the
average number L of packets stored in the links transmitter queue.
For concreteness, say that the line rate is 10 Mbps, that the packets are 1 Kbyte long, on
average, and that 1000 packets arrive at the transmitter per second, also on average. In this case, one
nds that
=
10
7
8 10
3
= 1, 250 packets per second.
Consequently,
T =
1
=
1
1, 250 1000
= 4ms and L =
=
1, 000
1, 250 1, 000
= 4.
Thus, an average packet transmission time is 1/ = 0.8ms, the average delay through the system is
4 ms and four packets are in the buffer, on average. A typical packet that arrives at the buffer nds
four other packets already there; that packet waits for the four transmission times of those packets
then for its own transmission time before it is out of the system. The average delay corresponds to
ve transmission times of 0.8ms each, so that T = 4ms. This delay T consists of the queuing time
3.2ms and one transmission time 0.8ms.
2.2. METRICS 15
Because of the randomness of the arrivals of packets and of the variability of the transmission
times, not all packets experience the same delay. With the M/M/1 model, it is, in principle, possible
for a packet to arrive at the buffer when it contains a very large number of packets. However, that is
not very likely. One can show that about 5% of the packets experience a delay larger than 12ms and
about 5% will face a delay less than 0.2ms. Thus, most of the packets have a delay between 0.2ms
and 12ms. One can then consider that the delay jitter is approximately 12ms, or three times the
average delay through the buffer.
To appreciate the effect of congestion, assume now that the packets arrive at the rate of 1,150
packets per second. One then nds that T =10ms and L = 11.5. In this situation, the transmission
time is still 0.8ms but the average queuing time is equal to 11.5 transmission times, or 9.2ms. The
delay jitter for this system is now about 30ms. Thus, the average packet delay and the delay jitter
increase quickly as the arrival rate of packets approaches the transmission rate .
Now imagine that N computers share a link that can send N packets per second. In this
case, replacing by N and by N in the previous formulas, we nd that the average delay is now
T/N and that the average number of packets is still L. Thus, by sharing a faster link, the packets
face a smaller delay. This effect, which is not too surprising if we notice that the transmission time
of each packet is now much smaller, is another benet of having computers share fast links through
switches instead of having slower dedicated links.
Summarizing, the M/M/1 model enables to estimate the delay and backlog at a transmission
line. The average transmission time (in seconds) is the average packet length (in bits) divided by the
link rate (in bps). For a moderate utilization (say 80%), the average delay is a small multiple
of the average transmission time (say 3 to 5 times). Also, the delay jitter can be estimated as about
3 times the average delay through the buffer.
2.2.7 LITTLES RESULT
Another basic result helps understand some important aspects of networks: Littles result, discovered
by John D. C. Little in 1961. Imagine a systemwhere packets arrive with an average rate of packets
per second. Designate by L the average number of packets in the system and by W the average time
that a packet spends in the system. Then, under very weak assumptions on the arrivals and processing
of the packets, the following relation holds:
L = W.
Note that this relation, called Littles Result holds for the M/M/1 queue, as the formulas indicate.
However, this result holds much more generally.
To understand Littles result, one can argue as follows. Imagine that a packet pays one dollar
per second it spends in the system. Accordingly, the average amount that each packet pays is W
dollars. Since packets go through the system at rate , the system gets paid at the average rate of W
per second. Now, this average rate must be equal to the average number L of packets in the system
since they each pay at the rate of one dollar per second. Hence L = W.
16 2. PRINCIPLES
As an illustration, say that 1 billion users send packets in the Internet at an average rate of
10 MByte per day each. Assume also that each packet spends 0.1s in the Internet, on average. The
average number of packets in the Internet is then L where
L = 10
9
8 10
7
24 3, 600
0.1 9 10
10
92 billions.
Some of these packets are in transit in the bers, the others are mostly in the routers.
As another example, let us try to estimate how many bits are stored in a ber. Consider a ber
that transmits bits at the rate of 2.4Gbps and the ber is used at 20% of its capacity. Assume that
the ber is 100km long. The propagation time of light in a ber is about 5s per km. This is equal
to the propagation time in a vacuum multiplied by the refractive index of the ber. Thus, each bit
spends a time W in the ber where W = 5s 100 = 5 10
4
s. The average arrival rate of bits
in the ber is = 0.2 2.4Gbps = 0.48 10
9
bps. By Littles result, the average number of bits
stored in the ber is L where
L = W = (0.48 10
9
) (5 10
4
) = 2.4 10
5
bits = 30Kbytes.
A router also stores bits. To estimate how many, consider a router with 16 input ports at 1
Gbps. Assume the router is used at 10% of its capacity. Assume also each bit spends 5 ms in the
router. We nd that the average arrival rate of bits is = 0.1 16 10
9
bps. The average delay is
W = 5 10
3
. Consequently, the average number L of bits in the router is
L = W = (1.6 10
9
) (5 10
3
) = 8 10
6
bits = 1MByte.
2.2.8 FAIRNESS
Consider two ows that would like to obtain the throughputs of 1.6 and 0.4Mbps, respectively.
Also suppose that these two ows have to share a bottleneck link with the link rate of 1 Mbps. If
a controller can regulate how the link rate is shared between the two ows, the question of fairness
arises, i.e., what rates should be allocated to the two ows. One possibility is to allocate the rates in
proportion to their desired throughputs, i.e., 0.8 and 0.2Mbps, respectively. This allocation may seem
unfair to the lower throughput owwho is not able obtain its desired throughput even though that is
less than half of the link rate. Another possibility is to maximize the allocation to the worse-off ow,
i.e., by allocating 0.6 and 0.4Mbps, respectively. The latter allocation is referred to as the max-min
allocation. Achieving fairness across a network is more challenging than the simple example of the
single link considered here. This can often lead to a trade-off between the network efciency and
fairness. We will explore this issue further in the chapter on models.
2.3 SCALABILITY
For a network to be able to grow to a large size, like the Internet, it is necessary that modications
have only a limited effect.
2.3. SCALABILITY 17
In the early days of the Internet, each computer had a complete list of all the computers
attached to the network. Thus, adding one computer required updating the list that each computer
maintained. Each modication had a global effect. Let us assume that half of the 1 billion computers
on the Internet were added during the last ten years. That means that during these last ten years
approximately 0.5 10
9
/(10 365) > 10
5
computers were added each day, on average.
Imagine that the routers in Figure 2.1 have to store the list of all the computers that are
reached to each of their ports. In that case, adding a computer to the network requires modifying a
list in all the routers, clearly not a viable approach.
Consider also a scheme where the network must keep track of which computers are cur-
rently exchanging information. As the network gets larger, the potential number of simultaneous
communications also grows. Such a scheme would require increasingly complex routers.
Needless to say, it would be impossible to update all these lists to keep up with the changes.
Another system had to be devised. We describe some methods that the Internet uses for scalability.
2.3.1 LOCATION-BASEDADDRESSING
We explained in the previous chapter that the IP addresses are based on the location of the devices,
in a scheme similar to the telephone numbers.
One rst benet of this approach is that it reduces the size of the routing tables, as we already
discussed. Figure 2.3 shows M = N
2
devices that are arranged in N groups with N devices each.
Each group is attached to one router.
1
2
4
3
N
5
1.1
1.2
1.N
N.N
N.2
N.1
Figure 2.3: A simple illustration of location-based addressing.
In this arrangement, each router 1, . . . , N needs one routing entry for each device in its group
plus one routing entry for each of the other N 1 routers. For instance, consider a packet that goes
from device 1.2 to device N.1. First, the packet goes from device 1.2 to router 1. Router 1 has a
routing entry that species the next hop toward router N, say router 5. Similarly, router 5 has a
routing entry for the next hop, router 4, and so on. When the packet gets to router N, the latter
consults its routing table to nd the next hop toward device N.1. Thus, each routing table has size
N +(N 1) 2
M.
18 2. PRINCIPLES
In contrast, we will explain that the Ethernet network uses addresses that have no relation
to the location of the computers. Accordingly, the Ethernet switches need to maintain a list of the
computers attached to each of their ports. Such an approach works ne for a relatively small number
of computers, say a fewthousands, but does not scale well beyond that. If the network has M devices,
each routing table needs M entries.
Assume that one day most electric devices will be attached to the Internet, such as light bulbs,
coffee machines, door locks, curtain motors, cars, etc. What is a suitable way to select the addresses
for these devices?
2.3.2 TWO-LEVEL ROUTING
The Internet uses a routing strategy that scales at the cost of a loss of optimality. The mechanism,
whose details we discuss in the chapter on routing, groups nodes into domains.
Each domain uses a shortest-path algorithm where the length of a path is dened as the sum
of the metrics of the links along the path. A faster link has a smaller metric than a slower link.
To calculate the shortest path between two nodes in a domain, the switches must exchange
quite a bit of metric information. For instance, each node could make up a message with the list of
its neighbors and the metrics of the corresponding links. The nodes can then send these messages
to each other. For a simple estimate of the necessary number of messages, say that there are N
routers and that each router sends one message to each of the other N 1 routers. Thus, each
router originates approximately N messages. For simplicity, say that each message goes through
some average number of hops, say 6, that does not depend on N. The total number of messages
that the N routers transmit is then about 6N
2
, or about 6N messages per router. This number of
messages becomes impractical when N is very large.
To reduce the complexity of the routing algorithms, the Internet groups the routers into
domains. It then essentially computes a shortest path across domains and each domain computes
shortest paths inside itself. Say that there are N domains with N routers inside each domain. Each
router participates in a shortest path algorithm for its domain, thus sending about 6N messages.
In addition, one representative router in each domain participates in a routing algorithm across
domains, also sending of the order of 6N messages. If the routing were done in one level, each
router would transmit 6N
2
messages, on average, which is N times larger than when using two-level
routing.
In the chapter on routing, we explain another motivation for a two-level routing approach. It
is rooted in the fact that inter-routing may correspond to economic arrangements as the domains
belong to different organizations.
2.3.3 BESTEFFORTSERVICE
A crucial design choice of the Internet is that it does not guarantee any precise property of the
packet delivery such as delay or throughput. Instead, the designers opted for simplicity and decided
that the Internet should provide a best effort service, which means that the network should try to
2.3. SCALABILITY 19
deliver the packets as well as possible, whatever this means. This design philosophy is in complete
contrast with that of the telephone network where precise bounds are specied for the time to get
a dial tone, to ring a called party, and for the delay of transmission of the voice signal. For instance,
voice conversation becomes very unpleasant if the one-way delay exceeds 250ms. Consequently, how
can one hope to use the Internet for voice conversations if it cannot guarantee that the delay is less
than 250ms? Similarly, a good video requires a throughput of at least 50Kbps. How can one use
the Internet for such applications? The approach of the Internet is that, as its technology improves,
it will become able to support more and more demanding applications. Applications adapt to the
changing quality of the Internet services instead of the other way around.
2.3.4 END-TO-ENDPRINCIPLEANDSTATELESS ROUTERS
Best effort service makes it unnecessary for routers (or the network) to keep track of the number
or types of connections that are set up. Moreover, errors are corrected by the source and destination
that arrange for retransmissions. More generally, the guiding principle is that tasks should not be
performed by routers if they can be performed by the end devices. This is called the end-to-end
principle.
Thus, the routers perform their tasks for individual packets. The router detects if a given
packet is corrupted and discards it in that case. The router looks at the packets destination address
to determine the output port. The router does not keep a copy of a packet for retransmission in the
event of a transmission or congestion loss on the next hop. The router does not keep track of the
connections, of their average bit rate or of any other requirement.
Accordingly, routers can be stateless: they consider one packet at a time and do not have any
information about connections. This feature simplies the design of routers, improves robustness,
and makes the Internet scalable.
2.3.5 HIERARCHICAL NAMING
The Internet uses an automated directory service called DNS (for Domain Name System). This
service translates the name of a computer into its address. For instance, when you want to connect
to the google server, the service tells your web browser the address that corresponds to the name
www.google.com of the server.
The names of the servers, such as www.google.com, are arranged in a hierarchy as a tree. For
instance, the tree splits up from the root into rst-level domains such as .com, .edu, .be, and so
on. Each rst-level domain then splits up into second-level domains, etc. The resulting tree is then
partitioned into subtrees or zones. Each zone is maintained by some independent administrative
entity, and it corresponds to some directory server that stores the addresses of the computers with the
corresponding names. For instance, one zone is eecs.berkeley.edu and another is stat.berkeley.edu
and ieor.berkeley.edu.
20 2. PRINCIPLES
The point of this hierarchical organization of DNS is that a modication of a zone only
affects the directory server of that zone and not the others. That is, the hierarchy enables a distributed
administration of the directory systemin addition to a corresponding decomposition of the database.
2.4 APPLICATIONANDTECHNOLOGY INDEPENDENCE
The telephone network was designed to carry telephone calls. Although it could also be used for
a few other applications such as teletype and fax, the network design was not exible enough to
support a wide range of services.
The Internet is quite different from the telephone network. The initial objective was to enable
research groups to exchange les. The network was designed to be able transport packets with a
specied format including agreed-upon source and destination addresses. A source fragments a
large le into chunks that t into individual packets. Similarly, a bit stream can be transported by a
sequence of packets. Engineers quickly designedapplications that usedthe packet transport capability
of the Internet. Examples of such applications include email, the world wide web, streaming video,
voice over IP, peer-to-peer, video conferences, and social networks. Fundamentally, information can
be encoded into bits. If a network can transport packets, it can transport any type of information.
C O L -
A C T -
S T A -
1 2 3 4 56 7 8 9 1 0 11 1 2
H S 1 H S 2 O K 1 O K 2 P S C O N S O L E
Router
Application
Transport
Network
Link
Physical
Byte Stream or Packets
End-to-End Packets
Packets
Bits
Signals
Packets
Bits
Signals
Figure 2.4: The ve layers of Internet.
2.4.1 LAYERS
To formalize this technology and application independence, the Internet is arranged into ve func-
tional layers as illustrated in Figure 2.4. The Physical Layer is responsible to deliver bits across a
physical medium. The Link Layer uses the bit delivery mechanism of the Physical Layer to deliver
packets across the link. The Network Layer uses the packet delivery across successive links to deliver
2.5. APPLICATIONTOPOLOGY 21
the packets from source to destination. The Transport Layer uses the end-to-end packet delivery to
transport individual packets or a byte stream from a process in the source computer to a process in
the destination computer. The transport layer implements error control (through acknowledgments
and retransmissions), congestion control (by adjusting the rate of transmissions to avoid congesting
routers), and ow control (to avoid overowing the destination buffer). Finally, the Application Layer
implements applications that use the packet or byte stream transport service. A packet from the
application layer may be a message or a le.
This layered decomposition provide compatibility of different implementations. For instance,
one canreplace a wired implementationof the Physical and Link Layers by a wireless implementation
without having to change the higher layers and while preserving the connectivity with the rest of
the Internet. Similarly, one can develop a new video conferencing application without changing the
other layers. Moreover, the layered decomposition simplies the design process by decomposing it
into independent modules. For instance, the designer of the physical layer does not need to consider
the applications that will use it. We explain in the chapter Models that the benets of layering come
at a cost of loss of performance.
2.5 APPLICATIONTOPOLOGY
Networked applications differ in how hosts exchange information. This section explains the main
possibilities. It is possible for a particular networked application to t the description of more than
one mode of information exchange described below.
2.5.1 CLIENT/SERVER
Web browsing uses a client/server model. In this application, the user host is the client and it connects
to a server such as google.com. The client asks the server for les and the server transmits them.
Thus, the connection is between two hosts and most of the transfer is from the server to the client.
In web browsing, the transfer is initiated when the user clicks on a hyperlink that species
the server name and the les. In some cases, the request contains instructions for the server, such a
keywords in a search or operations to perform such as converting units or doing a calculation.
Popular web sites are hosted by a server farm, which is a collection of servers (from hundreds
to thousands) equipped with a system for balancing the load of requests. That is, the requests arrive
at a device that keeps track of the servers that can handle them and of how busy they are; that device
then dispatches the requests to the servers most likely to serve them fast. Such a device is sometimes
called a layer 7 router. One important challenge is to adjust the number of active servers based on the
load to minimize the energy usage of the server farm while maintaining the quality of its service.
2.5.2 P2P
A P2P (Peer-to-Peer) system stores les in user hosts instead of specialized servers. For instance,
using BitTorrent, a user looking for a le (say a music MP3 le) nds a list of user hosts that have
22 2. PRINCIPLES
that le and are ready to provide it. The user can then request the le from those hosts. A number
of hosts can deliver different fractions of that le in parallel.
The P2P structure has one major advantage over the client/server model: a popular le is likely
to be available from many user hosts. Consequently, the service capacity of the system increases au-
tomatically with the popularity of les. In addition, the parallel download overcomes the asymmetry
between upload and download link rates. Indeed, the Internet connection of a residential user is
typically 3 times faster for downloads than for uploads. This asymmetry is justied for client/server
applications. However, it creates a bottleneck when the server belongs to a residential user. The
parallel downloads remove that asymmetry bottleneck.
2.5.3 CLOUDCOMPUTING
Cloud computing refers to the newparadigmwhere a user makes use of the computing service hosted
by a collection of computers attached to the network. A number of corporations make such services
available. Instead of having to purchase and install the applications on his server, a user can lease
the services from a cloud computing provider. The user can also upload his software on the cloud
servers. The business driver for Cloud Computing is the sharing of the computing resources among
different users. Provision of sufcient computing resources to serve a burst of requests satisfactorily
is one of the key issues.
The servers in the cloud are equipped with software to run the applications in a distributed
way. Some applications are well suited for such a decomposition, such as indexing of web pages
and searches. Other applications are more challenging to distribute, such as simulations or extensive
computations.
2.5.4 CONTENTDISTRIBUTION
A content distribution system is a set of servers (or server farms) located at various points in the
network to improve the delivery of information to users. When a user requests a le from one server,
that server may redirect the request to a server closer to the user. One possibility is for the server to
return a homepage with links selected based on the location of the requesting IP address. Akamai is
a content distribution system that many companies use.
2.5.5 MULTICAST/ANYCAST
Multicast refers to the delivery of les or streams to a set of hosts. The hosts subscribe to a multicast
and the server sends themthe information. The network may implement multicast as a set of one-to-
one (unicast) deliveries. Alternatively, the network may use special devices that replicate packets to
limit the number of duplicate packets that traverse any given link. Twitter is a multicast application.
Anycast refers to the delivery of a le to any one of a set of hosts. For instance, a request for
information might be sent to any one of a set of servers.
2.6. SUMMARY 23
2.5.6 PUSH/PULL
When a user browses the web, his host pulls information from a server. In contrast, push applications
send information to user hosts. For instance, a user may subscribe to a daily newspaper that a server
forwards according to a schedule when it nds the users host available.
2.5.7 DISCOVERY
In most applications, the user species the les he is looking for. However, some applications discover
and suggest information for the user. For instance, one application searches for users that are in the
neighborhood and are looking for someone to have dinner with. Another application lists menus of
nearby restaurants.
2.6 SUMMARY
The Internet has the following important features:
Packet switching enables efcient sharing of network resources (statistical multiplexing);
Hierarchical naming, location-based addressing, two-level routing, and stateless routers (pos-
sible because of the best effort service and the end-to-end principle) make the Internet scalable;
Layering simplies the design, and provides independence with respect to applications and
compatibility with different technologies;
Applications have different information structures that range from client server to cloud com-
puting to P2P to discovery.
Some key metrics of performance of the Internet are as follows:
The bandwidth (or bit rate) of a link;
The throughput of a connection;
The average delay and delay jitter of a connection;
Littles result relate the backlog, the average delay, and the throughput.
2.7 PROBLEMS
P2.1 Assume 400 users share a 100Mbps link. Each user is active a fraction 10% of the time, during
a busy hour. If all the active users get an equal fraction of the link rate, what is the average
throughput per active user?
P2.2 In this problem, we rene the estimates of the previous problem. If one ips a coin n 1 times
and each coin ip has a probability p (0, 1) of yielding Head, the average number of Heads
24 2. PRINCIPLES
is np. Moreover, the probability that the fraction of Heads is larger than p +1.3
p(1 p)/n
is about 10%. Use this fact to calculate the rate R such that the probability that the throughput
per active user in the previous problem is less than R is only 10%.
P2.3 Imagine a switch in a network where packets arrive at the rate = 10
6
packets per second.
Assume also that a packet spends T = 1 ms = 10
3
s in the switch, on average. Calculate the
average number of packets that the switch holds at any given time.
P2.4 Packets with an average length of 1KBytes arrive at a link to be transmitted. The arrival rate
of packets corresponds to 8Mbps. The link rate is 10Mbps. Using the M/M/1 delay formula,
estimate the average delay per packet. What fraction of that delay is due to queuing?
P2.5 Say that a network has N domains with M routers each and K devices are attached to each
router. Assume that the addresses are not geographically based but are assigned randomly
to the devices instead. How many entries are there in the routing table of each router if the
routing uses only one level. What is the network uses a 2-level routing scheme? Now assume
that the addresses are based on location. What is the minimum average size of the routing
table in each router?
P2.6 Consider a router in the backbone of the Internet. Assume that the router has 24 ports, each
attached to a 1Gbps link. Say that each link is used at 15% of its capacity by connections that
have an average throughput of 200Kbps. How many connections go through the router at any
given time? Say that the connections last an average of 1 minute. How many new connections
are set up that go through the router in any given second, on average?
P2.7 We would like to transfer 20Kbyte le across a network from node A to node F. Packets have a
length of 1Kbyte (neglecting header). The path from node A to node F passes through 5 links,
and 4 intermediate nodes. Each of the links is a 10Km optical ber with a rate of 10Mbps.
The 4 intermediate nodes are store-and-forward devices, and each intermediate node must
perform a 50s routing table look up after receiving a packet before it can begin sending it on
the outgoing link. How long does it take to send the entire le across the network?
P2.8 Suppose we would like to transfer a le of K bits from node A to node C. The path from node
A to node C passes through two links and one intermediate node, B, which is a store-and-
forward device. The two links are of rate R bps. The packets contain P bits of data and a 6
byte header. What value of P minimizes to time it takes to transfer the le from A to C? Now
suppose there are N intermediate nodes, what value of P minimizes the transfer time in this
case?
P2.9 Consider the case of GSM cell phones. GSM operates at 270.88Kbps and uses a spectrum
spanning 200KHz. What is the theoretical SNR (in dB) that these phones need for operation.
2.8. REFERENCES 25
In reality, the phones use a SNR of 10dB. Use Shannons theorem to calculate the theoreti-
cal capacity of the channel, under this signal-to-noise ratio. How does the utilized capacity
compare with the theoretical capacity?
2.8 REFERENCES
The theory of channel capacity is due to Claude Shannon (44). A fun tutorial on queuing theory
can be found at (46). Littles result is proved in (31). For a discussion of two-level routing, see (67).
The end-to-end principle was elaborated in (43). The structure of DNS is explained in (65). The
layered structure of the Internet protocols is discussed in (66).
27
C H A P T E R 3
Ethernet
Ethernet is a technology usedto connect upto a fewhundredcomputers anddevices.The connections
use wires of up to 100m or bers of up to a few km. The bit rate on these wires and bers is usually
100Mbps but can go to 1Gbps or even 10Gbps. The vast majority of computers on the Internet are
attached to an Ethernet network or, increasingly, its wireless counterpart WiFi. We discuss WiFi in
a separate chapter.
We rst review a typical Ethernet network. We then explore the history of Ethernet. We then
explain the addressing and frame structure. After a brief discussion of the physical layer, we examine
switched Ethernet. We then discuss Aloha and hub-based Ethernet.
3.1 TYPICAL INSTALLATION
Figure 3.1 shows a typical installation. The computers are attached to a hub or to a switch with wires
or bers. We discuss hubs and switches later in this chapter. We call ports the different attachments
of a switch or a hub (same terminology as for routers). A small switch or hub has 4 ports; a large
switch has 48 or more ports. The hub is used less commonly now because of the superiority of the
switch. The gure also shows a wireless extension of the network.
Figure 3.1: Typical Ethernet installation.
28 3. ETHERNET
3.2 HISTORY OF ETHERNET
Ethernet started as a wired version of a wireless network, the Aloha network. Interestingly, after two
decades, a wireless version of Ethernet, WiFi was developed.
3.2.1 ALOHANETWORK
The Aloha network was invented around 1970 by Norman Abramson and his collaborators. He
considered the scenario where multiple devices distributed over different Hawaiian islands wanted
to communicate with the mainframe computer at the main campus in Honolulu. Figure 3.2 shows
an Aloha network. The devices transmit packets to a central communication node attached to the
mainframe computer on frequency f
1
. This node acknowledges packets on frequency f
2
. A device
concludes that its packet is not delivered at the destination if it does not receive an acknowledgement
within a prescribed time interval. The procedure for transmission is as follows: If a device is not
..
f
1
f
2
f
1
f
2
A B K
time
A
C
C K
B
K
Figure 3.2: Aloha network and the timing of transmissions.
waiting for an acknowledgment of a previously transmitted packet and has a packet to transmit, it
transmits it immediately. If the device does not get an acknowledgment within a prescribed time
interval, it retransmits the packet after a random time interval.
The system uses two frequencies so that a device can receive acknowledgements regardless of
any transmissions. Since the receiver is overwhelmed by the powerful signal that the device transmits,
it cannot tell if another device is also transmitting at the same time. Consequently, a device transmits
a complete packet even though this packet may collide with another transmission, as shown in the
timing diagram.
The major innovation of the Aloha network is randomized multiple access. Using this mech-
anism, the device transmissions are not scheduled ahead of time. Instead, the devices resolve their
conicts with a distributed and randomized mechanism.
The mechanism of the Aloha network is as follows:
If an acknowledgement is not outstanding, transmit immediately;
3.2. HISTORY OF ETHERNET 29
If no acknowledgment, repeat after a random delay.
3.2.2 CABLEETHERNET
Between 1973 and 1976, Robert Metcalfe and his colleagues developed a wired version of the Aloha
network illustrated in Figure 3.3. This version became known as Ethernet. All the devices share a
time
A A
C
B
A B C K
C
K
K C
Figure 3.3: A cable Ethernet.
common cable. They use a randomized multiple access protocol similar to the Aloha network but
with two key differences. A device waits for the channel to be idle before transmitting, and listens
while it transmits and it aborts its transmission as soon as it detects a collision. These carrier sensing
and collision detection mechanisms reduce the time wasted by collisions.
Thus, the mechanism of this network is as follows:
Wait until channel is idle, wait for a random time;
Transmit when the channel is idle following that random time, but abort if sense collision;
If collide, repeat.
The stations choose their random waiting time (also called backoff time) as a multiple X of one
time slot. A time slot is dened to be 512 bit transmission times for 10Mbps and 100Mbps Ethernet
and 4096 bit times for Gbps Ethernet. A station picks X uniformly in {0, 1, . . . , 2
n
1} where
n = min{m, 10} and m is the number of collisions the station experienced with the same packet.
When m reaches 16, the station gives up and declares an error. This scheme is called the truncated
binary exponential backoff. Thus, the rst time a station attempts to transmit, it does so just when
the channel becomes idle (more precisely, it must wait a small specied interframe gap). After one
collision, the station picks X to be equal to 0 or 1, with probability 0.5 for each possibility. It then
waits for the channel to be idle and for X 512 bit times (at 10Mbps or 100Mbps). When the
channel is sensed to be idle following this waiting time, the station transmits while listening, and it
aborts when it detects a collision. A collision would occur if another station had selected the same
value for X. It turns out that 512 bit times is longer than 2PROP, so that if stations pick different
values for X, the one with the smallest value transmits and the others hear its transmission before
they attempt to transmit.
30 3. ETHERNET
The main idea of this procedure is that the waiting time becomes more random after multiple
collisions, which reduces the chances of repeated collisions. Indeed, if two stations are waiting for
the channel to be idle to transmit a packet for the rst time, then they collide with probability 1, then
collide again with probability 1/2, which is the probability that they both pick the same value of X
in {0, 1}. The third time, they select X in {0, 1, 2, 3}, so that the probability that they collide again
is 1/4. The fourth time, the probability is only 1/8. Thus, the probability that the stations collide k
times is (1/2) (1/4) (1/2
k1
) for k 11, which becomes small very quickly.
This scheme has an undesirable (possibly unintended) side effect called capture or winner takes
all. This effect is that a station that is unlucky because it happens to collide tends to have to keep on
waiting as other stations that did not collide transmit. To see this, consider two stations A and B.
Assume that they both have packets to transmit. During the rst attempt, they collide. Say that A
picks X = 0 and B picks X = 1. Then station A gets to transmit while B waits for one time slot. At
the end of As transmission, A is ready with a new packet and B is ready with its packet since it has
decremented it backoff counter from the initial value 1 to 0 while it was waiting. Both stations then
collide again. This was the rst collision for the new packet of A, so that station picks X in {0, 1}.
However, this was the second collision for Bs packet, so that station picks X in {0, 1, 2, 3}. It is then
likely that A will pick a smaller value and get to transmit again. This situation can repeat multiple
times. This annoying problem never got xed because Ethernet moved to a switched version.
3.2.3 HUBETHERNET
For convenience of installation, Rock, Bennett, and Thaler developed the hub-based Ethernet (then
called StarLAN) in the mid 1980s. In such a network shown in Figure 3.4 the devices are attached
to a hub with a point-to-point cable. When a packet arrives at the hub, it repeats it on all its other
ports. If two or more packets arrive at the hub at the same time, it sends a signal indicating that it
has detected a collision on all its output ports. (The cables are bi-directional.) The devices use the
B C A K
Hub
P
Hub repeats P on all other ports
P
P
P
B C A K
Hub
P
Hub signals "collision detected" (cd) on all ports
P'
cd cd cd cd
Figure 3.4: A hub-based Ethernet.
same mechanism as in a cable-based Ethernet.
3.3. ADDRESSES 31
3.2.4 SWITCHEDETHERNET
In 1989, the company Kalpana introduced the Ethernet switch. As Figure 3.5 shows, the switch
sends a packet only to the output port toward the destination of the packet. The main benet of a
B C A K
Switch
P
P'
P'
P
Figure 3.5: A switched Ethernet.
switch is that multiple packets can go through the switch at the same time. Moreover, if two packets
arrive at the switch at the same time and are destined to the same output port, the switch can store
the packets until it can transmit them. Thus, the switch eliminates the collisions.
3.3 ADDRESSES
Every computer Ethernet attachment is identied by a globally unique 48-bit string called an
Ethernet Address or MAC Address (for Media Access Control). This address is just an identier since
it does not specify where the device is. You can move from Berkeley to Boston with your laptop, and
its Ethernet address does not change: it is hardwired into the computer.
Because the addresses are not location-based, the Ethernet switches maintain tables that list
the addresses of devices that can be reached via each of its ports.
The address with 48 ones is the broadcast address. A device is supposed to listen to broadcast
packets. In addition, devices can subscribe to group multicast addresses.
3.4 FRAME
Ethernet packets have the frame structure shown in Figure 3.6.
Figure 3.6: Ethernet frame.
The frame has the following elds:
32 3. ETHERNET
A 7-byte preamble. This preamble consists of alternating ones and zeroes. It is used for syn-
chronizing the receiver.
A 1-byte start of frame delimiter. This byte is 10101011, and it indicates that the next bit is the
start of the packet.
A Length/Type eld. If this eld is larger than or equal to 1536, it indicates the type of
payload. This is the most common case. For instance, Field 0x80 0x00 indicates TCP/IP; that
packet species its own length in a eld inside the IP header. If the eld is less than or equal
to 1518, it corresponds to the length of the payload.
3.5 PHYSICAL LAYER
There are many versions of the physical layer of Ethernet. The name of a version has the form
[rate][modulation][media or distance]. Here are the most common examples:
10Base5 (10Mbps, baseband, coax, 500m);
10Base-T (10Mbps, baseband, twisted pair);
100Base-TX (100Mbps, baseband, 2 pair);
100Base-FX (100Mbps, baseband, ber);
1000Base-CX for two pairs balanced copper cabling;
1000Base-LX for long wavelength optical transmission;
1000Base-SX for short wavelength optical transmission.
3.6 SWITCHEDETHERNET
3.6.1 EXAMPLE
Figure 3.7 shows a network with three switches and ten devices. Each device has a unique address.
We denote these addresses by A, B, . . . , J.
3.6.2 LEARNING
When a computer with Ethernet address x sends a packet to a computer with Ethernet address y,
it sends a packet [y|x|dat a] to the switch. When it gets the packet, the switch reads the destination
address y, looks in a table, called forwarding table, to nd the port to which y is attached and sends
the packet on these wires.
The switch updates its forwarding table as follows. When it gets a packet with a destination
address that is not in the table, the switch sends a copy of the packet on all the ports, except the port
on which the packet came. Whenever it gets a packet, it adds the corresponding source address to
3.6. SWITCHEDETHERNET 33
A B C D E F
G H I J
1
2 3
4
1 2 3
5
1
2
3 4
5
A: 1
B: 2
C: 3
D, , J: 4
A B, C,: 4
D: 1
E: 2
F: 3
G, , J: 5
4
A, , F: 1
G: 2
H: 3
I: 4
J: 5
Figure 3.7: Example of switched Ethernet network.
the table entry that corresponds to the port on which it came. Thus, if packet [y|x|dat a] arrives to
the switch via port 3, the switch adds an entry [x = 3] to its table.
To select a unique path for packets in the network, Ethernet switches use the Spanning Tree
Protocol, as explained below. Moreover, the network administrator can restrict the links on which
some packets can travel by assigning the switch ports to virtual LANs. A virtual LAN, or VLAN,
is an identier for a subset of the links of an Ethernet network. When using VLANs, the Ethernet
packet also contains a VLANidentier for the packet. The source of the packet species that VLAN.
Each switch port belongs to a set of VLANs that the network administrator congures. Packets are
restricted to the links of its VLAN. The advantage of VLANs is that they separate an Ethernet
network into distinct virtual networks, as if these networks did not share switches and links. This
separation is useful for security reasons.
3.6.3 SPANNINGTREEPROTOCOL
For reliability, it is common to arrange for a network to have multiple paths between two devices.
However, with redundant paths, switches can loop trafc and create multiple copies.
To avoid these problems, the switches run the spanning tree protocol that selects a unique path
between devices. Using this protocol, the switches nd the tree of shortest paths rooted at the switch
with the smallest identication number (ID). The ID of a switch is the smallest of the Ethernet
addresses of its interfaces. What is important is that these IDs are different. To break up ties, a switch
selects, among interfaces with the same distance to the root, the next hop with the smallest ID.
The protocol operates as follows. (See Figure 3.8.) The switches sendpackets withthe informa-
tion [myID|Current Root ID|Dist ancet oCurrent Root ]. A switch relays packets whose Current
Root ID is the smallest the switch has seen so far and the switch adds one to the distance to current
34 3. ETHERNET
root. Eventually, the switches only forward packets from the switch with the smallest ID with their
distance to that switch.
B2
B3 B4
B1
B6
B5
4: [2|1|1] 3: [1|1|0]
1: [3|3|0]
5: [3|1|2]
6: [6|1|1]
2: [5|3|1]
Figure 3.8: Example of messages when the spanning tree protocol runs.
The gure shows that, rst, switch B3 sends packet [3|3|0] meaning I am 3, I think the root
is 3, and my distance to 3 is 0. Initially, the switches do not know the network. Thus, B3 does not
know that there is another switch with a smaller ID. In step 2, B5 who has seen the rst packet
sends [5|3|1] because it thinks that the root is 3 and it is 1 step away from 3. The different packet
transmissions are not synchronized. In the gure, we assume that happen in that order. Eventually,
B1 sends packet [1|1|0] that the other switches forward.
To see how ties are broken, consider what happens when B3 and B4 eventually relay the
packet from B1. At step 5, the gure shows that B3 sends [3|1|2] on its bottom port. At some later
step, not shown in the gure, B4 sends [4|1|2]. Switch B5 must choose between these two paths
to connect to B1: either via B3 or via B4 that are both two steps away from B1. The tie is broken
in favor of the switch with the smaller ID, thus B3 in this example. The result of the spanning tree
protocol is shown in Figure 3.8 by indicating the active links using solid lines.
Summing up, the spanning tree protocol avoids loops by selecting a unique pathinthe network.
Some switches that were installed for redundancy get disabled by the protocol. If an active switch
or link fails, the spanning tree protocol automatically selects a new tree.
Note that, although the spanning tree is composed of shortest paths to the root, the resulting
routing may be far from optimal. To see this, imagine a network arranged as a ring with 2N +1
switches. The two neighboring switches that are at distance N from the switch with the smallest
ID communicate via the longest path.
3.7. ALOHA 35
3.7 ALOHA
In this section, we explore the characteristics of the Aloha network . We start with a time-slotted
version of the protocol and then we study an non-slotted version.
3.7.1 TIME-SLOTTEDVERSION
Consider the following time-slotted version of the Aloha protocol. Time is divided into slots. The
duration of one time slot is enough to send one packet. Assume there are n stations and that the
stations transmit independently with probability p in each time slot. The probability that exactly
one station transmits in a given time slot is (see Section 3.10)
R(p) := np(1 p)
n1
.
The value of p that maximizes this probability can be found by setting the derivative of R(p)
with respect to p equal to zero. That is,
0 =
d
dp
R(p) = n(1 p)
n1
n(n 1)p(1 p)
n2
.
Solving this equation, we nd that the best value of p is
p =
1
n
.
This result conrms the intuition that p should decrease as n increases. Replacing p by 1/n
in the expression for R(p), we nd that
R(
1
n
) = (1
1
n
)
n1
1
e
0.37. (3.1)
The last approximation comes from the fact that
(1 +
a
n
)
n
e
a
for n 1.
The quality of the approximation (3.1) is shown in Figure 3.9.
Thus, if the stations are able to adjust their probability p of transmissionoptimally, the protocol
is successful at most 37% of the time. That is, 63% of the time slots are either idle or wasted by
transmissions that collide.
3.8 NON-SLOTTEDALOHA
So far, we assumed that all the Aloha stations were synchronized. What if they are not and can instead
transmit at arbitrary times? This version of Aloha is called non-slotted or pure. The interesting result
is that, in that case, they can only use a fraction 1/2e 18% of the channel. To model this situation,
36 3. ETHERNET
0
0.1
0.2
0.3
0.4
1 3 5 7 9 11 13 15 17 19
1/e 0.37
(1 - 1/n)
n
n
Figure 3.9: The approximation for (3.1).
consider very small time slots of duration 1. One packet transmission time is still equal to one
unit of time. Say that the stations start transmitting independently with probability p in every small
time slot and then keep on transmitting for one unit of time. The situation was shown in Figure 3.2.
The analysis in the appendix shows that by optimizing over p one gets a success rate of at
most 18%.
3.9 HUBETHERNET
The study of hub-based Ethernet is somewhat involved. The rst step is to appreciate the maximum
time T it takes to detect a collision. This results in a particular randomization procedure, which
incurs a wasted waiting time equal to an integer multiple of T . From that understanding, we can
calculate the efciency.
3.9.1 MAXIMUMCOLLISIONDETECTIONTIME
Imagine that two nodes Aand B try to transmit at about the same time. Say that Astarts transmitting
at time 0. (See Figure 3.10.) The signal from A travels through the wires to the hub which repeats
it. The signal then keeps on traveling towards B. Let PROP indicate the maximum propagation
time between two devices in this network. By time PROP, the signal from A reaches B. Now,
imagine that B started transmitting just before time PROP. It thought the system was idle and
could then transmit. The signal from B will reach A after a time equal to PROP, that is just before
time 2PROP. A little bit later, node A will detect a collision. Node B detected the collision around
time PROP, just after it started transmitting. To give a chance to A to detect the collision, B does
not stop as soon as it detects the collision. This might result in B sending such a short signal that
A might ignore it. Instead, B sends a jam signal, long enough to have the energy required for A to
notice it.
The nodes wait a random multiple of T = 2PROP before they starts transmitting and they
transmit if the system is still idle at the end of their waiting time. The point is that if they choose
3.10. APPENDIX: PROBABILITY 37
Hub A B
Packet from A
Time
0 0
PROP
Packet from B
2PROP 2PROP
PROP
Figure 3.10: The maximum time for A to detect a collision is 2PROP.
different multiples of T after the system became last idle, then they will not collide. To analyze the
efciency of this system, assume that n stations transmit independently with probability p in each
time slot of duration T . We know that the probability that exactly one station transmits during one
time slot is at most 1/e. Thus, as shown in the appendix, the average time until the rst success is
e time slots of duration T . After this success, one station transmits successfully for some average
duration equal to T RANS, dened as the average transmission time of a packet.
Thus, every transmission with duration T RANS requires a wasted waiting time of (e 1)
T = 2(e 1) PROP. The fraction of time during which stations transmit successfully is then
=
T RANS
2(e 1)PROP +T RANS
1
1 +3.4A
where A = T RANS/PROP.
3.10 APPENDIX: PROBABILITY
In our discussion of the Aloha protocol, we needed a few results from Probability Theory. This
appendix provides the required background.
3.10.1 PROBABILITY
Imagine an experiment that has N equally likely outcomes. For instance, one rolls a balanced die
and the six faces are equally likely to be selected by the roll. Say that an event A occurs when the
selected outcome is one of M of these N equally likely outcomes. We then say that the probability
of the event A is M/N and we write P(A) = M/N.
For instance, in the roll of the die, if the event A is that one of the faces {2, 3, 4} is selected,
then P(A) = 3/6.
38 3. ETHERNET
3.10.2 ADDITIVITY FOREXCLUSIVEEVENTS
For this example, say that the event B is that one of the faces {1, 6} is selected. Note that the
events A and B are exclusive: they cannot occur simultaneously. Then A or B is a new event that
corresponds to the outcome being in the set {1, 2, 3, 4, 6}, so that P(A or B) = 5/6. Observe that
P(A or B) = P(A) +P(B).
1
2
M
1
N
1
1 2 M
2
N
2
A
B
Figure 3.11: Additivity for exclusive events (left) and product for independent events (right).
In general, it is straightforward to see that if A and B are exclusive events, then P(A or B) =
P(A) +P(B). This idea is illustrated in the left part of Figure 3.11. Moreover, the same property
extends to any nite number of events that are exclusive two by two. Thus, if A
1
, . . . , A
n
are exclusive
two by two, then
P(A
1
or A
2
or A
n
) = P(A
1
) +P(A
2
) + +P(A
n
).
Indeed, if event A
m
corresponds to M
m
outcomes for m = 1, . . . , n and if the sets of outcomes
of these different events have no common element, then the event A
1
or A
2
or A
n
corresponds
to M
1
+ +M
n
outcomes and
P(A
1
or A
2
or A
n
) =
M
1
+ +M
n
N
=
M
1
N
+ +
M
n
M
= P(A
1
) + +P(A
n
).
3.10.3 INDEPENDENTEVENTS
Now consider two experiments. The rst one has N
1
equally likely outcomes and the second has N
2
equally likely outcomes. The event A is that the rst experiment has an outcome that is in a set of
M
1
of the N
1
outcomes. The event B is that the second experiment has an outcome in a set of M
2
of the N
2
outcomes. Assume also that the two experiments are performed independently so that
the N
1
N
2
pairs of outcomes of the two experiments are all equally likely. Then we nd that the
event A and B corresponds to M
1
M
2
possible outcomes out of N
1
N
2
, so that
P(A and B) =
M
1
M
2
N
1
N
2
=
M
1
N
1
M
2
N
2
= P(A)P(B).
3.10. APPENDIX: PROBABILITY 39
Thus, we nd that if A and B are independent, then the probability that they both occur is
the product of their probabilities. (See right part of Figure 3.11.)
For instance, say that we roll two balanced dice. The probability that the rst one yields an
outcome in {2, 3, 4} and that the second yields an outcome in {3, 6} is (3/6) (2/6).
One can generalize this property to any nite number of such independent experiments. For
instance, say that one rolls the die three times. The probability that the three outcomes are in {1, 3},
{2, 4, 5}, and {1, 3, 5}, respectively, is (2/6) (3/6) (3/6).
3.10.4 SLOTTEDALOHA
Recall the setup of the slotted Aloha. There are n stations that transmit independently with prob-
ability p in each time slot. We claim that the probability of the event A that exactly one station
transmits is
P(A) = np(1 p)
n1
. (3.2)
To see this, for m {1, 2, . . . , n}, dene A
m
to be the event that station m transmits and the
other do not. Note that
A = A
1
or A
2
. . . or A
n
and the events A
1
, . . . , A
n
are exclusive. Hence,
P(A) = P(A
1
) + +P(A
n
). (3.3)
Now, we claim that
P(A
m
) = p(1 p)
n1
, m = 1, . . . , n. (3.4)
Indeed, the stations transmit independently. The probability that station m transmits and the other
n 1 stations do not transmit is the product of the probabilities of those events, i.e., the product of
p and 1 p and ... and 1 p, which is p(1 p)
n1
.
Combining the expressions (3.4) and (3.3) we nd (3.2).
3.10.5 NON-SLOTTEDALOHA
Recall that in pure Aloha, we consider that the stations start transmitting independently with prob-
ability p in small time slots of duration 1. A packet transmission lasts one unit of time, i.e.,
1/ time slots. With very small, this model captures the idea that stations can start transmitting
at any time.
Consider one station, say station 1, that transmits a packet P, as shown in Figure 3.12. The
set of starts of transmissions that collide with P consists of 2/ 1 time slots: the 1/ 1 time
slots that precede the start of packet P and the 1/ time slots during the transmission of P. Indeed,
any station that would start transmitting during any one of these 2/ 1 time slots would collide
with P. Accordingly, the probability S(p) that station 1 starts transmitting in a given small time slot
40 3. ETHERNET
nS(p
) =
K +1
2
nS(p
) =
K +1
2
n
1
(n 1)K +1
(1
1
(n 1)K +1
)
(n1)K
.
With M := (n 1)K n(K +1) (n 1)K +1, we nd that the success rate is approximately
equal to
1
2
(1
1
M
)
M
1
2e
18%.
3.10.6 WAITINGFORSUCCESS
Consider the following experiment. You ip a coin repeatedly. In each ip, the probability of Head
is p. How many times do you have to ip the coin, on average, until you get the rst Head?
3.11. SUMMARY 41
To answer this question, let us look at a long sequence of coin ips. Say that there are on average
N 1 Tails before the next Head. Thus, there is one Head out of N coin ips, on average. The
fraction of Heads is then 1/N. But we know that this fraction is p. Hence, p = 1/N, or N = 1/p.
As another example, one has to roll a die six times, on average, until one gets a rst outcome
equal to 4. Similarly, if the probability of winning the California lottery is one in one million per
trial, then one has to play one million times, on average, before the rst win. Playing once a week,
one can expect to wait about nineteen thousand years.
3.10.7 HUBETHERNET
As we explainedinthe sectiononHubEthernet, the stations wait a randommultiple of T = 2PROP
until they attempt to transmit. In a simplied model, one may consider that the probability that
one station transmits in a given time slot with duration 2PROP is approximately 1/e. Thus, the
average time until one station transmits alone is e time slots. Of this average time, all but the last
time slot is wasted. That is, the stations waste (e 1) time slots of duration 2PROP for every
successful transmission, on average. Thus, for every transmission with duration T RANS, there is a
wasted time (e 1) 2PROP 3.4PROP.
Consequently, the fraction of time when the stations are actually using the network to transmit
packets successfully is T RANS divided by T RANS +3.4PROP. Thus, the efciency of the Hub
Ethernet is
T RANS
T RANS +3.4PROP
.
3.11 SUMMARY
Ethernet is a widely used networking technology because it is low cost and fast. The chapter reviews
the main operations of this technology.
Historically, the rst multiple access network where nodes share a transmission medium was
the Aloha network. A cable-based multiple access network followed, then a star-topopoly
version that used a hub, then the switch replaced the hub.
Each Ethernet device attachment has a unique 48-bit address.
Multiple versions of the Ethernet physical layer exist, with rates that range from 10Mbps to
10Gbps.
An Ethernet switch learns the list of devices attached to each port from the packet source
addresses. Switches run a spanning tree protocol to avoid loops.
The efciency of slotted Aloha is at most 37% and that of non-slotted Aloha is at most 18%.
That of an hub-based Ethernet decreases with the transmission rate and increases with the
average packet length.
42 3. ETHERNET
The appendix reviews the basic properties of probabilities: the additivity for exclusive events
and the product for independent events. It applies these properties to the analysis of the
efciency of simple MAC protocols.
3.12 PROBLEMS
P3.1 We have seen several calculations showing that the efciency of an Ethernet random access
scheme is well below 100%. Suppose we knew that there are exactly N nodes in the Ethernet.
Heres a strategy: we divide the time into N slots and make the 1st node use the 1st slot, 2nd
node use 2nd slot and so on (this is called time division multiplexing). This way, we could
achieve 100% efciency and there would never be any collision!! Whats the problem with this
plan?
P3.2 Consider a random access MAC protocol like Slotted ALOHA. There are N nodes sharing a
media, and time is divided into slots. Each packet takes up a single slot. If a node has a packet
to send, it always tries to send it out with a given probability. A transmission succeeds if a
single node is trying to access the media and all other nodes are silent.
(a) Suppose that we want to give differentiated services to these nodes. We want different
nodes to get a different share of the media. The scheme we choose works as follows: If
node i has a packet to send, it will try to send the packet with probability p
i
. Assume that
every node has a packet to send all the time. In such a situation, will the nodes indeed
get a share of the media in the ratio of their probability of access?
(b) Suppose there are 5 nodes, and the respective probabilities are p
1
= 0.1, p
2
= 0.1, p
3
=
0.2, p
4
= 0.2 and p
5
= 0.3. On an average, what are the probabilities that each node is
able to transmit successfully, in a given time slot.
(c) Now suppose that nodes do not always have a packet to send. In fact, the percentage
of time slots when a node has a packet to send (call it busy time b
i
) is the same as its
probability of access, i.e. b
i
= p
i
. For simplicitys sake, do not consider any queuing or
storing of packets only that node i has a packet to send on bi of the slots. In such a
situation, is the share of each node in the correct proportion of their access probability,
or busy time?
P3.3 In recent years, several networking companies have advocated the use of Ethernet (and
VLANs) for networks far beyond a "local" area. Their view is that Ethernet as a technology
could be used for much wider areas like a city (Metro), or even across several cities. Suggest two
nice features of Ethernet that would still be applicable in a wider area. Also suggest two other
characteristics which would not scale well, and would cause problems in such architectures.
P3.4 Consider a commercial 10 Mbps Ethernet conguration with one hub (i.e., all end stations
are in a single collision domain).
3.12. PROBLEMS 43
(a) Find Ethernet efciency for transporting 512 byte packets (including Ethernet overhead)
assuming that the propagation delay between the communicating end stations is always
25.6 s, and that there are many pairs of end stations trying to communicate.
(b) Recall that the maximum efciency of Slotted Aloha is 1/e. Find the threshold for the
frame size (including Ethernet overhead) suchthat Ethernet is more efcient thanSlotted
Aloha if the xed frame size is larger than this threshold. Explain why Ethernet becomes
less efcient as the frame size becomes smaller.
P3.5 Ethernet standards require a minimum frame size of 512 bits in order to ensure that a node
can detect any possible collision while it is still transmitting. This corresponds to the number
of bits that can be transmitted at 10 Mbps in one roundtrip time. It only takes one propagation
delay, however, for the rst bit of an ethernet frame to traverse the entire length of the network,
and during this time, 256 bits are transmitted. Why, then, is it necessary that the minimum
frame size is 512 bits instead of 256?
P3.6 Consider the corporate ethernet shown in the gure below.
Figure 3.13: Figure for Ethernet Problem 6.
(a) Determine which links get deactivated after the Spanning Tree Algorithm runs, and
indicate them on the diagram by putting a small X through the deactivated links.
(b) A disgruntled employee wishes to disrupt the network, so she plans on unplugging
central Bridge 8. How does this affect the spanning tree and the paths that ethernet
frames follow?
P3.7 In the network shown below, all of the devices want to transmit at an average rate of R Mbps,
with equal amounts of trafc going to every other node. Assume that all of the links are
44 3. ETHERNET
half-duplex and operate at 100 Mbps and that the media access control protocol is perfectly
efcient. Thus, each link can only be used in one direction at a time, at 100 Mbps. There is
no delay to switch from one direction to the other.
Figure 3.14: Figure for Ethernet Problem 7.
(a) What is the maximum value of R?
(b) The hub is now replaced with another switch. What is the maximum value of R now?
3.13 REFERENCES
The Aloha network is described and analyzed in (1). Ethernet is introduced in (33). One key innova-
tion is the protocol where stations stop transmitting if they detect a collision. The actual mechanism
for selecting a randomtime before transmitting (the so-called exponential backoff scheme) is another
important innovation. For a clear introduction to probability, see (20).
45
C H A P T E R 4
WiFi
WiFi is the name of the industry consortium that decides implementations of the wireless Ethernet
standards IEEE802.11. In this chapter, we explain the basics of a WiFi network, including the
multiple access scheme used by such a network.
4.1 BASICOPERATIONS
In this chapter, we only describe the most widely used implementation of WiFi: the infrastructure
mode with the Distributed Coordination Function (DCF). WiFi devices (laptops, printers, some cel-
lular phones, etc.) are equipped with a radio that communicates with an Access Point (AP) connected
to the Internet. The set of devices that communicate with a given access point is called a Basic Service
Set (BSS). The main components of WiFi are the MAC Sublayer and the Physical Layer.
The MAC Sublayer uses a binary exponential backoff. A station that gets a correct packet
sends an acknowledgement (ACK) after waiting a short time interval. To send a packet a station
must wait for a longer time interval plus the random backoff delay, so that it will never interfere with
the transmission of an ACK. Finally, a variation of the protocol calls for station that wants to send a
packet P to rst transmit a short Request-to-Send (RTS) to the destination that replies with a short
Clear-to-Send (CTS). Both the RTS and CTS indicate the duration of the transmission of P and
of the corresponding ACK. When they hear the RTS and the CTS, the other stations refrain from
transmitting until P and its ACK have been sent. Without the RTS/CTS, the stations close to the
destination might not hear the transmission of P and might interfere with it. We discuss these ideas
further later in the chapter.
The Physical Layer operates in the unlicensed band (2.4GHz or 5GHz) and uses a sophisticated
modulation scheme that supports a transmission rate from 1Mbps to 600Mbps, depending on the
version. Each version uses a number of disjoint frequency channels, which enables the network
manager to select different channels for neighboring BSSs to limit interference. The radios adjust
their transmission rate to improve the throughput. When a few packets are not acknowledged, the
sender reduces the transmission rate, and it increases the rate again after multiple successive packets
get acknowledged.
4.2 MEDIUMACCESS CONTROL (MAC)
We rst describe the MAC protocol, then discuss the addresses that WiFi uses.
46 4. WIFI
4.2.1 MACPROTOCOL
The MAC protocol uses a few constants dened as follows:
Table 4.1: Constants for the
MAC Protocol.
Constant 802.11b 802.11a
Slot Time 20s 9s
SIFS 10s 16s
DIFS 50s 34s
EIFS 364s 94s
CW
min
31 15
CW
max
1023 1023
When a station (a device or the access point) receives a correct WiFi packet, it sends an ACK
after a SIFS (for Short Inter-frame Spacing).
To send a packet, a station must wait for the channel to be idle for DIFS (for DCF Inter-
Frame Spacing) plus the random backoff delay. Since DIFS > SIFS, a station can send an ACK
without colliding with a packet. The backoff delay is X Slot Times, where X is picked uniformly in
{0, 1, . . . , CW
n
} where
CW
n
= min(CW
max
, (CW
min
+1) 2
n
1), n = 0, 1, 2, . . . (4.1)
where n is the number of previous attempts at transmitting the packet. For instance, in 802.11b,
CW
0
= 31, CW
1
= 63, CW
2
= 127, and so on. Thus, the station waits until the channel is idle,
then waits for a time interval DIFS. After that time, the station decrements X by one every Slot
Time. If another station transmits, it freezes its value of X and resumes decrementing it every Slot
Time after the channel has been idle for DIFS. The station transmits when X reaches 0.
If a station receives an incorrect WiFi packet, it must wait for EIFS (for Extended Inter-
Frame Spacing) before attempting to transmit, to leave a chance for the stations that might be
involved in the previous transmission to complete their exchange of an ACK. In that case, the
stations that are waiting to transmit decrement their value of X every Slot Time only after the
channel has been idle for EIFS.
Figure 4.1 illustrates some steps in this MAC protocol in the case of 802.11b. There are
two WiFi devices A and B and one access point X. The gure shows that a previous transmission
completes, then the stations A, B, X wait for a DIFS, they generate their backoff delays shown as
A, B, X. Assuming that the stations try for the rst time transmitting a new packet, these random
backoff delays are picked uniformly in CW
0
= {0, 1, . . . , 31}. Since the backoff delay of B is the
smallest, station B transmits its packet; after SIFS, the destination (the access point X) sends an
ACK. The stations Aand X had frozen their backoff counters during the transmission of the packet
by B and of the ACK by X. After the channel is idle again for DIFS, the stations A and X resume
4.2. MEDIUMACCESS CONTROL (MAC) 47
decrementing their counters and B generates a newrandom backoff B. In this example, the counters
of A and B expire at the same time and these stations start transmitting at the same time, which
results in a collisions. The stations do not listen while transmitting, so the channel becomes idle only
at the end of the longest transmission (of B in this example). After the channel is idle for DIFS, the
stations A and B pick a new random backoff in CW
1
since they already had one attempt for their
current packet. Station X resumes decrementing its counter which expires before that of A and B.
At that time, station X transmits, after SIFS, A sends an ACK, and the procedure continues.
Time
A
Previous
Transmission
DIFS
50s
B
X
Random backoff times
uniform in {0, 31}x20s
B X
SIFS
10s
ACK
A
X
B
Residual waiting
times of A and X
B
A
X
A
B
X A
Random backoff times
uniform in {0, 63}x20s
Figure 4.1: Steps in WiFi MAC.
4.2.2 ENHANCEMENTS FORMEDIUMACCESS
The MAC protocol described above is a Carrier Sense Multiple Access with Collision Avoidance
(CSMA/CA) protocol. In general, wireless networks based on CSMA can suffer from two po-
tential issues: (i) Hidden terminal problem, and (ii) Exposed terminal problem. The hidden terminal
problem refers to the situation where two sending devices cannot hear each other (i.e., they are
hidden from each other), and hence one of them can begin transmission even if the other device is
already transmitting.This canresult ina collisionat the receiving device. Ininfrastructure mode WiFi
networks, this issue can manifest itself in the situation where two devices hidden from each other
end up having overlapping transmissions, and the AP failing to receive either of those transmissions.
The exposed terminal problems refers to the situation where a device refrains from transmitting as
it senses the medium to be busy, but its intended destination is different from that of the ongoing
transmission, and hence could have in fact transmitted without any collision. In infrastructure mode
WiFi networks, since all communication takes place via the AP, the reader can convince oneself that
the exposed terminal problem is not really an issue.
To overcome the hidden terminal problem, WiFi networks can make use of the RTS and CTS
messages. The idea is that a sender wanting to transmit data, rst sends out an RTS message obeying
the rules of medium access described above. This RTS message carries the address of the receiver
to whom the sender wishes to send data. Upon receiving the RTS message, the intended receiver
48 4. WIFI
replies with a CTS message. Subsequently, upon receiving the CTS message, the sender begins data
transmission which is followed by an ACK by the receiver. The entire sequence of RTS, CTS, data
and ACK messages are exchanged as consecutive messages separated by SIFS. Any device receiving
either the RTS or CTS message deduces that the channel is going to be busy for the impending
data transfer. Consider the situation where two devices hidden from each other want to send data
to the AP. If the rst device sends out an RTS message and the AP responds with a CTS message,
the other device should hear at least the CTS message. Hence, the RTS/CTS exchange makes the
devices hidden from the sending device aware of the busy channel status.
The MAC Sublayer also provides another indication of how long the channel is going to be
busy. This is done by including the Duration eld in the frame header (see Figure 4.2). This eld is
used to update Network Allocation Vector (NAV) variable at each device. The NAV variable indicates
how long the channel is going to remain busy with the current exchange. Use of NAV for keeping
track of channel status is referred to as Virtual Carrier Sensing.
4.2.3 MACADDRESSES
Figure 4.2, adapted from (56), shows the MAC frame format. We briey describe below the elds
in a frame. See (56) and (17) for further details.
Frame
Control
Duration
or ID
Address 1 Address 2 Address 3
Sequence
Control
Frame
Payload
Address 4 FCS
Bytes: 2 2 6 6 6 6 2 4 0-2312
Figure 4.2: Generic MAC Frame Format.
The Frame Control eld is used to indicate the type of frame (e.g., data frame, ACK, RTS,
CTS, etc.) and some other frame parameters (e.g., whether more fragments of the higher layer frame
are to follow, whether the frame is a retransmitted frame, etc.). The Duration eld is used for setting
NAV as mentioned above. Observe that the frame header contains up to four MAC addresses.
Address 1 and Address 2 are the MAC addresses of the receiver and the transmitter, respectively.
For extended LAN communication, Address 3 is the MAC address of source or destination device
depending on whether the frame is being transmitted or received by the AP. Address 4 is used only in
the extended LAN application where a WiFi network is used as a wireless bridge. In a typical home
installation, the AP device also acts as a router. Here, only Address 1 and Address 2 are important
as Address 3 coincides with one of these two addresses. It should be noted that the control frames
(e.g., RTS, CTS and ACK frames) and management frames (e.g., association request and beacon
frames) do not need to use all the four addresses. The Sequence Control eld is used to indicate
the higher layer frame number and fragment number to help with defragmentation and detection
duplicate frames. Finally, the FCS eld provides the frame checksum to verify the integrity of a
received frame.
4.3. PHYSICAL LAYER 49
4.3 PHYSICAL LAYER
WiFi networks operate in the unlicensed spectrum bands. In the USA, these bands are centered
around 2.4 or 5 GHz. Table 4.2 lists different versions of WiFi networks and their key attributes.
Table 4.2: Different Types of WiFi Networks.
Key Standards Max Rate Spectrum (U.S.) Year
802.11 2Mbps 2.4GHz 1997
802.11a 54Mbps 5GHz 1999
802.11b 11Mbps 2.4GHz 1999
802.11g 54Mbps 2.4GHz 2003
802.11n 600Mbps 2.4 & 5GHz 2009
To get an idea of the frequency planning used, let us consider the IEEE 802.11b based WiFi
networks. Here the available spectrum around 2.4GHz is divided into 11 channels with 5MHz
separation between the consecutive center frequencies. In order to minimize co-channel interference,
channels 1, 6 and 11 are commonly used giving channel separation of 25MHz. Adjacent BSSs
typically use different channels withmaximumseparationto minimize interference. Optimal channel
assignment for different BSSs in a building or a campus is a non-trivial problem. In a relatively large
building or campus (e.g., a department building on a university campus), it is common to have an
Extended Service Set (ESS) created by linking BSSs. In such an environment, it is possible for a
device to move from one BSS to the next in a seamless fashion.
WiFi networks are based on complex Physical Layer technologies. For example, the WiFi net-
works based on the IEEE 802.11a standard make use of Orthogonal Frequency Division Multiplexing
(OFDM). See (58; 57; 58; 17) for further details on the Physical Layer.
4.4 EFFICIENCY ANALYSIS OF MACPROTOCOL
In this section, we examine the efciency of the WiFi MAC protocol, dened as the data throughput
that the stations can achieve. We rst consider the case of a single active station. The analysis
determines the data rate and concludes that only 58% of the bit rate is used for data, the rest being
overhead or idle time between frames. We then examine the case of multiple stations and explain a
simplied model due to Bianchi (8).
4.4.1 SINGLEDEVICE
To illustrate the efciency of WiFi networks, we rst consider the scenario where a single device is
continuously either transmitting to or receiving from the AP 1500 bytes of data payload without
using the RTS/CTS messages. Note that since only one device is involved, there are no channel
collisions. We neglect propagation delay and assume that there are no losses due to channel errors. In
50 4. WIFI
this illustration, we consider an IEEE 802.11b based WiFi network. Recall that for such a network,
the basic system parameters are as follows: Channel Bit Rate = 11Mbps, SIFS = 10s, Slot Time =
20s, and DIFS = 50s. As described in the mediumaccess rules, after each successful transmission,
the device performs a backoff for the number of Slot Times uniformly choseninthe set {0, 1, . . . , 31}.
Figure 4.3 shows how the channel is used in this scenario. Preamble and Physical Layer Convergence
Procedure (PLCP) Header shown here are the Physical Layer overheads transmitted at 1Mbps.
Figure 4.3: Channel Usage for Single User Scenario.
Let us calculate the data throughput for this scenario neglecting the propagation time. First,
observe that the average backoff time for each frame corresponds to the average of 15.5 Slot Times,
i.e., 15.5 20 = 310s. FromFigure 4.3, note that the other overheads for eachdata frame are DIFS,
Physical Layer overhead for data frame, MAC header and CRC, SIFS, Physical Layer overhead for
ACK, and MAC Sublayer ACK. One can now easily calculate that the overheads mentioned above,
including the backoff time, equal to a total of 788.9s on average, and that the data payload takes
1500 8/11 = 1090.9s. This amounts to the data throughput of 1500 8/(788.9 +1090.9) =
6.38Mbps, or overall efciency of 6.38/11 = 58%.
4.4.2 MULTIPLEDEVICES
G. Bianchi (8) has provided an insightful analysis for the WiFi MAC protocol. He considers the
scenario with n WiFi devices who always have data to transmit, and analyzes the Markov chain
{s(k), b(k)} for a given device where s(k) denotes the number of previous attempts for transmitting
the pending packet at this device, and b(k) denotes the backoff counter, both at the epoch k. (See
4.4. EFFICIENCY ANALYSIS OF MACPROTOCOL 51
the appendix for a brief tutorial on Markov chains.) This analysis also applies to the scenario where
n 1 devices and the AP have data to transmit. The Markov chain is embedded at the epoch ks
when the device adjusts its backoff counter. The transition probabilities of this Markov chain are
depicted in Figure 4.4, adapted from (8). W
i
and m here relate to CW
min
and CW
max
in (4.1) as
follows: W
i
= 2
i
(CW
min
+1) and CW
max
+1 = 2
m
(CW
min
+1).
p/W
m
p/W
i
i-1,0
0,0 0,1 0,2 0,W
0
-2
0,W
0
-1
1 1 1 1
(1-p)/W
0
(1-p)/W
0
(1-p)/W
0
(1-p)/W
0
(1-p)/W
0
i,0 i,1 i,2 i,W
i
-2
i,W
i
-1
1 1 1 1
p/W
1
p/W
1
p/W
1
p/W
1
p/W
1
m,0 m,1 m,2 m,W
m
-2
m,W
m
-1
1 1 1 1
p/W
i
p/W
i
p/W
i
p/W
i
p/W
i+1
p/W
i+1
p/W
i+1
p/W
i+1
p/W
i+1
p/W
m
p/W
m
p/W
m
p/W
m
p/W
m
p/W
m
p/W
m
p/W
m
p/W
m
Figure 4.4: Transition probabilities for the backoff window size Markov chain.
To understand this diagram, consider a station that gets a new packet to transmit. The state
of that station is the node at the top of the diagram. After waiting for DIFS, the station com-
putes a backoff value uniformly in {0, 1, . . . , 31}. The state of the station is then one of the states
(0, 0), (0, 1), . . . , (0, 31) in the diagram. The rst component of the state (0) indicates that the
station has made no previous attempt at transmitting the pending packet. The second component of
the state is the value of the backoff counter from the set {0, 1, . . . , 31}. The station then decrements
its backoff counter at suitable epochs (when no other station transmits and the channel is idle for a
52 4. WIFI
Slot Time, or after a DIFS and a Slot Time
1
following the channel busy condition). For instance,
the state can then move from (0, 5) to (0, 4). Eventually, the backoff counter gets to 0, so that
the state is (0, 0) and the station transmits. If the transmission collides, then the station computes
another backoff value X, now uniformly distributed in {0, 1, . . . , 63}. The state of the station then
jumps to (1, X) where the rst component indicates that the station already made one attempt. If
the transmission succeeds, the state of the station jumps to the top of the diagram. The other states
are explained in the same way.
The key simplifying assumptioninthis model is that, whenthe stationtransmits, it is successful
with probability 1 p and collides with probability p, independently of the state (i, 0) of the station
and of its previous sequence of states. This assumption makes it possible to analyze one station in
isolation, and is validated using simulations. By analyzing the Markov chain, one can nd the
probability that its state is of the form (i, 0) and that the station then attempts to transmit. Let us
call this probability . That is, is the probability that at a given transmission opportunity (i.e.,
when its backoff counter reaches 0), this station attempts to transmit. Assume that all the n stations
have the same average probability of transmitting. Assume also that their transmission attempts
are all independent. The probability 1 p that this station succeeds (and does not collide) is then
the probability that the other n 1 stations do not transmit. That is,
1 p = (1 )
n1
.
Indeed, we explained in Section 3.10 that the probabilities of independent event multiply.
Summing up, the analysis proceeds as follows. One assumes that the station has a probability
1 p of success when it transmits. Using this value of p, one solves the Markov chain and derives
the attempt probability for each station. This depends on p, so that = (p), some function
of p. Finally, one solves 1 p = (1 (p))
n1
. This is an equation in one unknown variable p.
Having solved for p, we know the value of the attempt probability (p).
We outline below how the Markov chain is solved to obtain (p). Let and P denote the
invariant distribution and the probability transition matrix for the Markov chain, respectively. See
the appendix for an introduction to these concepts. The transition probabilities corresponding to P
are shown in Figure 4.4. By denition of the invariant distribution, we have = P. This gives
(x)
y=x
P(x, y) =
y=x
(y)P(y, x) for each x.
This identity can be interpreted to say that the "total ow" in and out of a state are equal in
equilibrium. These are referred to as the balanced equations for the Markov chain. For the Markov
chain under consideration, recall that the state is the two dimensional vector {s, b} dened earlier.
Applying the balanced equations and using the transition probabilities shown in Figure 4.4, it can
be seen that
(i 1, 0)p = (i, 0) (i, 0) = p
i
(0, 0) for 0 < i < m. (4.2)
1
This additional Slot Time has a negligible impact on the results and is neglected in (8).
4.4. EFFICIENCY ANALYSIS OF MACPROTOCOL 53
Identity (4.2) is obtained by recursive applications of the balanced equations. Observe that,
using the balanced equations (i 1, 0)
p
W
i
= (i, W
i
1) and (i, W
i
2) = (i 1, 0)
p
W
i
+
(i, W
i
1), we can obtain (i, W
i
2) = (i 1, 0)
2p
W
i
. Continuing this way, we nally get,
(i 1, 0)p = (i, 0). Applying the balanced equations recursively in a similar way and after some
algebraic manipulations (see (8)), we get
(m1, 0)p = (1 p)(m, 0) (m, 0) =
p
m
1 p
(0, 0), and (4.3)
(i, k) =
W
i
k
W
i
(i, 0), for i {0, . . . , m} and k {0, . . . , W
i
1}. (4.4)
Using the identities (4.2), (4.3) and (4.4), we nd an expression for each (i, k) in terms
of (0, 0), and then using the fact that
i
k
(i, k) = 1, we solve for (0, 0), and hence each
(i, k). Recalling that
=
m
i=0
(i, 0),
we nally nd
(p) =
2(1 2p)
(1 2p)(W +1) +pW(1 (2p)
m
)
,
where W = CW
min
+1.
Using we calculate the throughput of the network as follows. Time duration between
two consecutive epochs of the Markov chain has a successful transmission with probability :=
n(1 )
n1
. Let T be the average length of this duration. The rate of bit transmissions is then
given by B/T where B is the average number of bits transmitted during a successful transmission.
Indeed, during a typical duration, 0 bits are transmitted with probability 1 and an average of B
bits are transmitted with probability . Thus, B is the average number of bits transmitted in an
average duration of T .
To calculate T , one observes that T corresponds to either an idle Slot Time or a duration that
contains one or more simultaneous transmissions. If the duration contains exactly one transmission,
T corresponds to the transmission time of a single packet, a SIFS, the transmission time of the
ACK, and a DIFS. In the case of more than one transmission in this duration, T corresponds to
the longest of the transmission times of the colliding packets and a DIFS. See (8) for the details
regarding computation of T .
Simulations in (8) conrmthe quality of the analytical approximation. That paper also extends
the main analysis approach outlined here and presents extensive analysis for many other interesting
aspects, including and analysis of theoretically maximum throughput achievable if the Congestion
Window were not to be constrained by the standards specications and were picked optimally
depending on n.
54 4. WIFI
4.5 APPENDIX: MARKOVCHAINS
A Markov chain describes the state X
n
at time n = 0, 1, 2 . . . of a system that evolves randomly in a
nite set X called the state space. (The case of an innite state space is more complex.) The rules of
evolution are specied by a matrix P = [P(i, j), i, j X] called the transition probability matrix.
This matrix has nonnegative entries and its rows sum to one. That is,
P(i, j) 0, i, j X and
j
P(i, j) = 1, i X.
When the state is i, it jumps to j with probability P(i, j), independently of its previous values. That
is,
P[X
n+1
= j|X
n
= i, X
m
, m < n] = P(i, j), i, j X, n = 0, 1, 2, . . . .
This expression says that the probability that X
n+1
= j given that X
n
= i and given the previous
values {X
m
, m < n} is equal to P(i, j). You can imagine that if X
n
= i, then the Markov chain rolls a
die to decide where to go next. The die selects the value j with probability P(i, j). The probabilities
of the possible different values add up to one.
1 2
3
1
0.4
0.6
0.3
0.7
1 2
3
0.5
0.4
0.6
0.3
0.7
0.5
Figure 4.5: Two Markov chains: Irreducible (left) and reducible (right).
Figure 4.5 illustrates two Markov chains with state space X = {1, 2, 3}. The one on the left
correspond to the transition probability matrix P given below:
P =
0 1 0
0.6 0 0.4
0.3 0 0.7
.
The rst row of this matrix states that P(1, 1) = 0, P(1, 2) = 1, P(1, 3) = 0, which is consistent
with the diagram in Figure 4.5. In that diagram, an arrow from i to j is marked with P(i, j). If
P(i, j) = 0, then there is no arrow from i to j. The diagram is called the state transition diagram of
the Markov chain.
The two Markov chains are similar, but they differ in a crucial property. The Markov chain
on the left can go from any i to any j, possibly in multiple steps. For instance, it can go from 1 to 3
4.5. APPENDIX: MARKOVCHAINS 55
in two steps: from 1 to 2, then from 2 to 3. The Markov chain on the right cannot go from 1 to 2.
We say that the Markov chain on the left is irreducible. The one on the right is reducible.
Assume that the Markov chain on the left starts at time 0 in state 1 with probability (1),
in state 2 with probability (2), and in state 3 with probability (3). For instance, say that =
[(1), (2), (3)] = [0.3, 0.2, 0.5]. What is the probability that X
1
= 1? We nd that probability
as follows:
P(X
1
= 1) = P(X
0
= 1)P(1, 1) +P(X
0
= 2)P(2, 1) +P(X
0
= 3)P(3, 1)
= 0.3 0 +0.2 0.6 +0.5 0.3 = 0.27. (4.5)
Indeed, there are three exclusive ways that X
1
= 1: Either X
0
= 1 and the Markov chain jumps
from 1 to 1 at time 1, or X
0
= 2 and the Markov chain jumps from 2 to 1 at time 1, or X
0
= 3 and
the Markov chain jumps from 3 to 1 at time 1. The identity above expresses that the probability
that X
1
= 1 is the sum of the probabilities of the three exclusive event that make up the event
that X
1
= 1. This is consistent with the fact that the probabilities of exclusive events add up, as we
explained in Section 3.10.2.
More generally, if is the (row) vector of probabilities of the different states at time 0, then
the row vector of probabilities of the states at time 1 is given by P, the product of the row vector
by the matrix P. For instance, the expression (4.5) is the rst component of P.
In our example, P = . For instance, P(X
1
= 1) = 0.27 = P(X
0
= 1) = 0.3. That is, the
probabilities of the different states change over time. However, if one were to start the Markov chain
at time 0 with probabilities such that P = , then the probabilities of the states would not
change. We call such a vector of probabilities an invariant distribution. The following result is all
we need to know about Markov chains.
Theorem4.1
Let X
n
be an irreducible Markov chain on a nite state space X with transition probability matrix
P. The Markov chain has a unique invariant distribution . Moreover, the long term fraction of time that
X
n
= i is (i) for i X, independently of the distribution of X
0
.
For our Markov chain on the left of Figure 4.5, we nd the invariant distribution by solving
P = and (1) +(2) +(3) = 1.
After some algebra, one nds = [0.3, 0.3, 0.4].
The long-term fraction of time that X
n
= 1 is then equal to 0.3 and this fraction of time does
not depend on how one starts the Markov chain.
The Markov chain on the right of Figure 4.5 is reducible, so the theorem does not apply.
However, it is quite clear that after a few steps, the Markov chain ends up being in the subset {1, 3}
of the state space. One can then consider the Markov chain reduced to that subset, and it is now
irreducible. Applying the theorem, we nd that there is a unique invariant distribution, and we can
calculate it to be [3/8, 0, 5/8].
For details on Markov chains, see (21).
56 4. WIFI
4.6 SUMMARY
AWiFi BSS is a set of devices communicating over a common channel using the WiFi MAC
protocol. We focus on the infrastructure mode WiFi networks where all communication takes
place via the AP.
WiFi networks operate in an unlicensed band around 2.4 or 5GHz. In the 2.4GHz band, they
typically operate over channels 1, 6 and 11.
IEEE 802.11, 802.11a, 80211b, 802.11g, or 802.11n are the key specications with different
Physical Layers for WiFi networks. These specications support Physical Layer rates up to
600Mbps.
The MAC Sublayer in WiFi networks is based on the CSMA/CA techniques. DCF is its
prevalent mode of operation.
Medium access is regulated using different IFS parameters, i.e., SIFS, DIFS and EIFS.
The MAC header of a data frame contains four MAC addresses. In a BSS with the AP having
integrated router functions, only two of these addresses are relevant.
RTS/CTS mechanism resolves the hidden terminal problem.
Virtual Carrier Sensing using NAV is an important MAC Sublayer enhancement for avoiding
collisions.
Given the high MACSublayer and Physical Layer overheads, network efciency (equivalently,
system throughput) is an important characterization for WiFi networks.
4.7 PROBLEMS
Figure 4.6: Figure for WiFi Problem 1.
P4.1 Consider a wireless network shaped like a pentagon. The wireless nodes are shown at the
vertices A, B, C, D, and E, and the nodes are placed such that each node can talk only to its
4.7. PROBLEMS 57
two neighbors as shown. Thus there 10 unidirectional wireless links in this network. Assume
that the nodes employ RTS/CTS and also require ACKs for successful transmission.
Consider a situation when A is transmitting a packet to B. Obviously, link AB is active, and
all links that are affected by this transmission must keep quiet. Considering RTS/CTS, and
ACKs, indicate which of the other links could also be active at the same time. In other words,
indicate which of the other links could be simultaneously transmitting.
P4.2 Consider a Wireless LAN (WLAN) operating at 11Mbps that follows the 802.11 MAC
protocol with the parameters described below. All the data frames are of a xed size. Assume
that the length of the average backoff time is 1.5T Slot Times where T is the transmission
time (in Slot Times), including the overhead, of the frame contending for medium access.
Assume the parameters for this WLAN as follows: Slot Time = 20s, DIFS = 2.5 Slot Times,
SIFS = 0.5 Slot Times, RTS = CTS = ACK = 10 Slot Times (including overhead), and data
frame overhead = 9 Slot Times.
Determine the threshold value (in bytes) for the data payload such that the data transfer
efciency is greater with the RTS/CTS mechanism for frames with data payload larger than
this threshold.
P4.3 Suppose there are only two nodes, a source and a destination, equipped with 802.11b wireless
LAN radios that are congured to use RTS/CTS for packets of all sizes. The source node has
a large amount of data to send to the destination node. Also, the nodes are separated by 750
meters and have powerful enough radios to communicate at this range. No other nodes operate
in the area. What will be the data throughput between source and destination assuming that
packets carry 1100 bytes of data?
Assume the following parameters:
Propagation speed is 3 10
8
meters/sec.
Slot Time = 20s, SIFS = 10s, DIFS = 50s and CW
min
= 31 Slot Times.
The preamble, the Physical Layer header, the MAC header and trailer take a combined
200s per packet to transmit.
The Data is transmitted at 11Mbps.
ACK, RTS, and CTS each take 200s to transmit.
P4.4 Consider the Markov chain model due to G. Bianchi (8) discussed in the chapter for the
scenario where only the Access Point and a single WiFi device have unlimited data to send to
each other. Assume that CW
min
= CW
max
= 1 and Slot Time = 20s.
(a) Draw the transition probability diagram for the Markov chain.
(b) Derive the invariant distribution for the Markov chain.
58 4. WIFI
(c) Calculate the probability that a duration between two consecutive epochs of the Markov
chain would have a successful transmission.
(d) Assume that each WiFi data frame transports 1000 bytes of payload, and that a duration
between two consecutive Markov chain epochs with a successful transmission and that
involving a collision are 60 and 50 Slot Time long, respectively. Calculate the network
throughput in Mbps as seen by the upper layers.
4.8 REFERENCES
The specications of the IEEE802.11 standards can be found in (56), (57), and (58). The text by
Gast (17) describes these standards. The efciency model is due to Bianchi (8).
59
C H A P T E R 5
Routing
This chapter explores how various networks determine the paths that packets should follow.
Internet routers use a two-level scheme: inter-domain and intra-domain routing. The intra-
domain algorithms nd the shortest path to the destination. These mechanisms extend easily to
anycast routing when a source wants to reach any one destination in a given set. We also discuss
multicast routing, which delivers packets to all the destination in a set. The inter-domain routing
uses an algorithm where each domain selects a path to a destination domain based on preference
policies.
We conclude with a discussion of routing in ad hoc wireless networks.
5.1 DOMAINS ANDTWO-LEVEL ROUTING
The nodes in the Internet are grouped into about 40,000 (in 2009) Autonomous Systems (domains)
that are under the management of separate entities. For instance, the Berkeley campus network is one
domain, so is the MIT network, and so are the AT&T, Verizon, and Sprint cross-country networks.
The routing in each domain is performed by a shortest-path protocol called an intra-domain routing
protocol . The routing across domains is implemented by an inter-domain routing protocol. Most
domains have only a few routers. A few large domains have hundreds or even thousands of routers.
A typical path in the Internet goes though less than half a dozen domains. The Internet has a
small-world topology where two domains are only a few hand-shakes away from each other, like we
all are.
There are two reasons for this decomposition. The rst one is scalability. Intra-domain shortest
path routing requires a detailed view of the domain; such a detailed view is not possible for the full
Internet. The second reason is that domain administrators may prefer for economic, reliability,
security, or other reasons to send trafc through some domains rather than others. Such preferences
require a different type of protocol than the strict shortest path intra-domain protocols.
5.1.1 SCALABILITY
The decomposition of the routing into two levels greatly simplies the mechanisms. To appreciate
that simplication, say that nding a shortest path between a pair of nodes in a network with N nodes
requires each node to send about N messages, to the other nodes. In that case, if a network consists
of M domains of N nodes each, a one-level shortest path algorithm requires each node sending
about MN messages. On the other hand, in a two-level scheme, each node sends N messages to the
60 5. ROUTING
other nodes in its domain and one representative node in each domain may send M messages to the
representatives in the other domains.
5.1.2 TRANSITANDPEERING
The different Internet Service Providers (ISPs) who own networks make agreements to carry each
others trafc. There are two types of agreement: peering and transit. In a peering agreement, the
ISPs reciprocally provide free connectivity to each others local or inherited customers. In a transit
agreement, one ISP provides (usually sells) access to all destinations in its routing table.
ISP A ISP B ISP C P P
A
1
A
k
B
1
B
m
C
1
C
n
{A}
{B}
{B}
{C}
ISP A ISP B ISP C T T
A
1
A
k
B
1
B
m
C
1
C
n
{A}
{B} {C}
{A} {B}
{C}
Figure 5.1: Peering (left) and transit (right) agreements.
Figure 5.1 illustrates these agreements. The left part of the gure shows three ISPs (ISP
A, ISP B, ISP C). ISP A and ISP B have a peering agreement. Under this agreement, ISP A
announces the routes to its customers (A
1
, ..., A
k
). Similarly, ISP B announces the routes to its
customers(B
1
, ..., B
m
). The situation is similar for the peering agreement between ISP B and ISP
C. Note, however, that ISP B does not provide connectivity between ISP A and ISP C.
The right part of Figure 5.1 shows transit agreements between ISP A and ISP B and be-
tween ISP B and ISP C. Under this agreement, ISP B announces the routes to all the customers
it is connected to and agrees to provide connectivity to all those customers. Thus, ISP B provides
connectivity between ISP A and ISP C (for a fee).
Figure 5.2 shows a fairly typical situation. Imagine that ISP A and ISP B are two campuses
of some university. ISP C is the ISP that provides connectivity between the campuses and the rest of
the Internet, as specied by the transit agreement between ISP Cand the two campuses. By entering
in a peering agreement, ISP A and ISP B provide direct connectivity between each other without
having to go through ISP C, thus reducing their Internet access cost. Note that the connection
between ISP A and ISP C does not carry trafc from ISP B, so that campus A does not pay transit
fees for campus B.
5.2 INTER-DOMAINROUTING
At the inter-domain level, each domain essentially looks like a single node. The inter-domain routing
problemis to choose a path across those nodes to go fromone source domain to a destination domain.
For instance, the problem is to choose a sequence of domains to go from the Berkeley domain to the
MIT domain. A natural choice would be the path with the fewest hops. However, Berkeley might
5.2. INTER-DOMAINROUTING 61
ISP A ISP B
ISP C
A
1
A
k
B
1
B
m
P
T T
T
Figure 5.2: Typical agreements.
prefer to send its trafc through the Sprint domain because that domain is cheaper or more reliable
than the AT&T domain.
1
For that reason, the Inter-Domain protocol of the Internet currently the Border Gateway
Protocol, or BGP, is based on a path vector algorithm.
5.2.1 PATHVECTORALGORITHM
The most deployed inter-domain routing protocol in the Internet is the Border Gateway Protocol
which is based on a path vector algorithm. When using a path vector algorithm, as shown in Figure
5.3, the routers advertise to their neighbors their preferred path to the destination The gure again
A
B
C
D
D
D
A
B
C
D
D
D
BD
BC
A
B
C
D
Figure 5.3: Path vector algorithm.
considers destination D. Router D advertises to its neighbors B and C that its preferred path to D
is D. Router B then advertises to A that its preferred path to D is BD, similarly, C advertises to A
that its preferred path to D is CD. Router A then selects its preferred path to D among ABD and
ACD. Since the full paths are specied, router A can base its preference not only on the number
of hops but also on other factors, such as existing transit or peering agreements, pricing, security,
reliability, and various other considerations. In the gure, we assume that router A prefers the path
ACD to ABD. In general, the preferred path does not have to be the shortest one.
1
This situation is hypothetical. All resemblance to actual events is purely accidental and unintentional.
62 5. ROUTING
Figure 5.4 provides an illustration.
Berkeley
AT&T
Sprint Verizon
MIT
MIT
Verizon-MIT
MIT
AT&T-MIT
Sprint- Verizon-MIT
(a)
(b)
(a): Verizon-MIT
(b): AT&T-MIT
Figure 5.4: Routing across domains: Inter-domain routing.
The inter-domain protocol of the Internet enables network managers to specify preferences
other than shortest path. Using that protocol, the Berkeley domain is presented with two paths
to the MIT domain: Sprint-Verizon-MIT and AT&T-MIT. The Berkeley manager can specify a
rule that states, for instance: 1) If possible, nd a path that does not use AT&T; 2) Among the
remaining paths, choose the one with the fewest hops; 3) In case of a tie, choose the next domain in
alphabetical order. With these rules, the protocol would choose the path Sprint-Verizon-MIT. (In
the actual protocol, the domains are represented by numbers, not by names.)
Summarizing, each domain has one or more representative routers and the representatives
implement BGP. Each representative has a set of policies that determine how it selects the preferred
path among advertised paths. These policies also specify which paths it should advertise, as explained
in the discussion on peering and transit.
5.2.2 POSSIBLEOSCILLATIONS
Whereas it is fairly clear that one can nd a shortest path without too much trouble, possibly after
breaking ties, it is not entirely clear that a path-vector protocol converges if domain administrators
select arbitrary preference rules. In fact, simple examples show that some rules may lead to a lack of
convergence. Figure 5.5 illustrates such an example. The gure shows four nodes A, B, C, D that
represent domains. The nodes are fully connected. The preferences of the nodes for paths leading
to D are indicated in the gure. For instance, node A prefers the path ACD to the direct path AD.
Similarly, node C prefers CBD to CD and node B prefers BAD to BD. For instance, say that the
links to Dare slower than the other links and that the preference is to go counter-clockwise to a faster
next link. Assume also that nodes prefer to avoid three-hop paths. The left part of the gure shows
that A advertises its preferred path ACD to node B. Node B then sees the path ACD advertised
by A, the direct path BD, and the path CBD that C advertises. Among these paths, B selects the
path BD since BACD would have three hops and going to C would induce a loop. The middle part
of the gure shows that B advertises its preferred path BD to node C. Given its preferences, node
5.2. INTER-DOMAINROUTING 63
ACD
AD
CBD
CD
BAD
BD
A B
D
C
B should select BD C should select CBD
ACD
ACD
AD
CBD
CD
BAD
BD
A B
D
C
BD
A should select AD
ACD
AD
CBD
CD
BAD
BD
A B
D
C
CBD
Figure 5.5: A path-vector protocol may fail to converge.
C then selects the path CBD that it advertises to node A, as shown in the right part of the gure.
Thus, the same steps repeat and the choices of the nodes keep on changing at successive steps. For
instance, A started preferring ACD, then it prefers AD, and so on. This algorithm fails to converge.
The point of this example is that, although path-vector enables domain administrators to
specify preferences, this exibility may result in poorly-behaving algorithms.
5.2.3 MULTI-EXITDISCRIMINATORS
Figure 5.6 shows ISP A attached to ISP B by two connections. ISP A informs ISP B that the top
connection is closer to the destinations X than the bottom connection. To provide that information,
ISP A attaches a discriminator to its local destinations. The discriminator represents a metric from
the router attached to the connection to the destination. With this information the router R4 in
ISP B calculates that it should send trafc destined for X via the upper connection. ISP B does not
forward these discriminators.
R1
R2
X
R3
Y
R4
(Y, 3), (X, 1)
(X, 3), (Y, 1)
to X
{X, Y}, ...
ISP A
ISP B
Figure 5.6: The graph (left) and link state messages (right).
64 5. ROUTING
5.3 INTRA-DOMAINSHORTESTPATHROUTING
Inside a single domain, the routers use a shortest path algorithm. We explain two algorithms for
nding shortest paths: Dijkstra and Bellman-Ford. We also explain the protocols that use those
algorithms.
5.3.1 DIJKSTRAS ALGORITHMANDLINKSTATE
Figure 5.7 illustrates the operations of a link state algorithm.
A: [B, 2], [C, 1]
B: [A, 2], [D, 1]
C: [A, 1], [D, 3]
D: [B, 1], [C, 3]
1) Exchange Link States 2) Each router computes
the shortest paths to
the other routers and
enters the results in
its routing table.
A
B
C
D
2 1
1
3
Figure 5.7: Link state routing algorithm.
When using a link state algorithm , the routers exchange link state messages. The link state
message of one router contain a list of items [neighbor, dist ance t o neighbor]. That list species
the metric of the link to each neighbor. The metric is representative of the time the node takes to
send a message to that neighbor. In the simplest version, this metric is equal to one for all the links.
In a more involved implementation, the metric is a smaller number for faster links. The length of a
path is dened as the sum of the metrics of the links of that path.
Table 5.1: Routing table
of router A.
Destination Next Hop
B B
C C
D B
Thus, every router sends a link state message to every other router. After that exchange, each
router has a complete view of the network, and it can calculate the shortest path from each node
to every other node in the network. The next hop along the shortest path to a destination depends
only on the destination and not on the source of the packet. Accordingly, the router can enter the
shortest paths in its routing table. For instance, the routing table of node Acontains the information
shown in Table 5.1.
The top-left graph in Figure 5.8 shows nodes attached with links; the numbers on the links
represent their metric. The problem is to nd the shortest path between pairs of nodes.
5.3. INTRA-DOMAINSHORTESTPATHROUTING 65
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(1)
1
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(2)
1
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(3)
1
2
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(4)
1
2
3
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(5)
1
2
3
3
B
A
F
C
E
D
1
4
1
1
2
1
4
3
P(6): Final
1
2
3
3
5
Figure 5.8: Dijkstras routing algorithm.
To nd the shortest paths from node A to all the other nodes, Dijkstras algorithm computes
recursively the set P(k) of the k-closest nodes to node A (shown with black nodes). The algorithm
starts with P(1) = {A}. To nd P(2), it adds the closest node to the set P(1), here B, and writes
its distance to A, here 1. To nd P(3), the algorithm adds the closest node to P(2), here E, and
writes its distance to A, here 2. The algorithm continues in this way, breaking ties according to a
deterministic rule, here favoring the lexicographic order, thus adding C before F. This is a one-pass
algorithm whose number of steps is the number of nodes.
After running the algorithm, each node remembers the next step along the shortest path to
each node and stores that information in a routing table. For instance, As routing table species that
the shortest path to E goes rst along link AB. Accordingly, when A gets a packet destined to E, it
sends it along link AB. Node B then forwards the packet to E.
5.3.2 BELLMAN-FORDANDDISTANCEVECTOR
Figure 5.9 shows the operations of a distance vector algorithm. The routers regularly send to their
neighbors their current estimate of the shortest distance to the other routers.The gure only considers
destination D, for simplicity. Initially, only node D knows that its distance to D is equal to zero, and
it sends that estimate to its neighbors B and C. Router B adds the length 1 of the link BD to the
estimate 0 it got from D and concludes that its estimated distance to D is 1. Similarly, C estimates
the distance to D to be equal to 3 +0 = 3. Routers B and C send those estimates to A. Node A
then compares the distances 2 +1 of going to D via node B and 1 +3 that corresponds to going to
C rst. Node A concludes that the estimated shortest distance to D is 3. At each step, the routers
66 5. ROUTING
A
B
C
D
0
A
B
C
D A
B
C
D
1
3
1
2 1
3
1
2
1
3
1
2
0
1
3
3
Figure 5.9: Link state routing algorithm.
remember the next hop that achieves the shortest distance and indicates that information in their
routing table. The gure shows these shortest paths as thick lines.
For instance, the routing table at router A is the same as using the link state algorithm and is
shown in Table 5.1.
Bad News Travel Slowly
One difculty with this algorithm is that it may take a long time to converge to new shortest paths
when a link fails, as the example below shows. The top of Figure 5.10 shows three nodes A, B, C
(on the right) attached with links with metric 1. The Bellman-Ford algorithm has converged to the
estimates of the lengths of the shortest paths to destination C. At that point, the link between nodes
B and C breaks down.
The middle part of the gure shows the next iteration of the Bellman-Ford algorithm. Node
A advertises its estimate 2 of the shortest distance to node C and node B realizes that the length of
the link BC is now innite. Node B calculates its new estimate of the shortest distance to C as 1 +
2 = 3, which is the length of the link from B to A plus the distance 2 that A advertises from A to
C. Upon receiving that estimate 3 from B, A updates its estimate to 1 + 3 = 4. The bottom part of
the gure shows the next iteration of the algorithm. In the subsequent iterations, the estimates keep
increasing.
We explain how to handle such slow convergence in the next section.
A B C
0
1 2
A B C
0
3 4
A B C
0
5 6
Figure 5.10: Bad news travel slowly when nodes use the Bellman-Ford algorithm.
5.4. ANYCAST, MULTICAST 67
Convergence
Assume that the graph does not change and that the nodes use the Bellman-Ford algorithm. More
precisely, say that at step n 0 of the algorithm each node i has an estimate d
n
(i) of its shortest
distance to some xed destination i
0
. Initially, d
0
(i) = for i = i
0
and d
0
(i
0
) = 0. At step n +1,
some node i receives the estimate d
n
(j) from one of its neighbors j. At that time, node i updates
its estimate as
d
n+1
(i) = min{d
n
(i), d(i, j) +d
n
(j)}.
The order in which the nodes send messages is arbitrary. However, one assumes that each node i
keeps sending messages to every one of its neighbors.
The claimis that if the graph does not change, then d
n
(i) d(i), where d(i) is the minimum
distance from node i to node i
0
. To see why this is true, observe that d
n
(i) can only decrease after i
has received one message. Also, as soon as messages have been sent along the shortest path from i
0
to i, d
n
(i) = d(i).
Note that this algorithm does not converge to the true values if it starts with different initial
values. For instance, if it starts with d
n
(i) = 0 for all i, then the estimates do not change. Con-
sequently, for the estimates to converge again after some link breaks or increases it length, the
algorithm has to be modied. One approach is as follows. Assume that node i receives an estimate
from neighbor j that is larger than a previous estimate it got from that node previously. Then node i
resets its estimate to and send that estimate to its neighbors. In this way, all the nodes reset their
estimates and the algorithms restarts with d
0
(i) = for i = 0 and d
0
(i
0
) = 0. The algorithm then
converges, as we saw earlier.
5.4 ANYCAST, MULTICAST
When a source sends a message to a single destination, we say that the message is unicast. In some
applications, one may want to send a message to any node in a set. For instance, one may need to
nd one of many servers that have the same information or to contact any member of a group. In
such a case, one says that the message must be anycast. In other applications, all members of a group
must get the message. This is called multicast. In particular, if the mesage must be sent to all other
nodes, we say that it is broadcast. We explain the anycast and multicast routing.
5.4.1 ANYCAST
Figure 5.11 shows a graph with two shaded nodes. The anycast routing problem is to nd the
shortest path from every other node to any one of the shaded nodes. One algorithm is identical to
the Bellman-Ford algorithm for unicast routing. For a node i, let d(i) be the minimum distance
from that node to the anycast set. That is, d(i) is the minimum distance to any node in the anycast
set. If we denote by d(i, j) the metric of link (i, j), then it is clear that
d(i) = min{d(i, j) +d(j)}
68 5. ROUTING
2
2
2
2
2
1
3
1
3
3
3
A
B
C
D
E
F
G
Figure 5.11: Shortest paths (thick lines) to set of shaded nodes.
where the minimumis over the neighbors j of node i. This is precisely the same relation as for unicast.
The only difference is the initial condition: for anycast, d(k) = 0 for all node k in the anycast set.
Another algorithm is Dijkstras shortest path algorithm that stops when it reaches one of the
shaded nodes. Recall that this algorithm calculates the shortest paths from any given node to all
the other nodes, in order of increasing path length. When it rst nds one of the target nodes, the
algorithm has calculated the shortest path from the given node to the set.
5.4.2 MULTICAST
Figure 5.12 shows a graph. Messages from A must be multicast to the two shaded nodes. The left
part of the gure shows a tree rooted at A whose leaves are the shaded nodes and whose sum of
link lengths is minimized over all such trees. This tree is called the Steiner tree (named after Jacob
Steiner). The right part of the gure shows the tree of shortest paths from A to the shaded nodes.
Note that the sum of the lengths of the links of the tree on the right (6) is larger than the sum of
the lengths of the links of the Steiner tree (5).
Finding a Steiner tree is NP-hard. In practice, one uses approximations. One multicast pro-
tocol for Internet uses the tree of shortest paths.
B
2
1
2
2
1
2
1
2
2
1
A
C
A
B
C
Figure 5.12: Minimum weight (Steiner) tree (left) and tree of shortest paths from A (right).
The most efcient approach is to use one Steiner multicast tree for each multicast source. In
practice, one may use a shared tree for different sources. For instance, one could build a Steiner from
San Francisco to the major metropolitan areas in the US. If the source of a multicast is in Oakland,
5.4. ANYCAST, MULTICAST 69
it can connect to that tree. This is less optimal than a Steiner tree from oakland, but the algorithm
is much less complex.
5.4.3 FORWARDERRORCORRECTION
Using retransmissions to make multicast reliable is not practical. Imagine multicasting a le or a
video stream to 1000 users. If one packet gets lost on a link of the multicast tree, hundreds of users
may miss that packet. It is not practical for them all to send a negative acknowledgment. It is not
feasible either for the source to keep track of positive acknowledgments from all the users.
A simpler method is to add additional packets to make the transmission reliable. This scheme
is called a packet erasure code because it is designed to be able to recover from erasures" of packets,
that is, from packets being lost in the network. For instance, say that you want to send packets P1
and P2 of 1KByte to a user, but that it is likely that one of the two packets could get lost. To improve
reliability, one can send {P1, P2, C} where C is the addition bit by bit, modulo 2, of the packets P1
and P2. If the user gets any two of the packets {P1, P2, C}, it can recover the packets P1 and P2.
For instance, if the user gets P1 and C, it can reconstruct packet P
2
by adding P1 and C bit by bit,
modulo 2.
This idea extends to n packets {P1, P2, . . . , Pn} as follows: one calculates each of the pack-
ets {C1, C2, . . . , Cm} as the sum bit by bit modulo 2 of a randomly selected set of packets in
{P1, P2, . . . , Pn}. The header of packet Ck species the subset of {1, 2, . . . , n} that was used to
calculate Ck. If m is sufciently large, one can recover the original packets from any n of the packets
{C1, C2, . . . , Cm}.
One decoding algorithm is very simple. It proceeds as follows:
If one of the Ck, say Cj, is equal to one of the packets {P1, . . . , Pn}, say Pi, then Pi has been
recovered. One then adds Pi to all the packets Cr that used that packet in their calculation.
One removes the packet Cj from the collection and one repeats the procedure.
If at one step one does not nd a packet Cj that involves only one Pi, the procedure fails.
P1 P2 P3 P4
C1 C2 C3 C4 C5 C6 C7
Figure 5.13: Calculation of FEC packets.
70 5. ROUTING
Figure 5.13 illustrates the procedure. The packets {P1, . . . , P4} are to be sent. One calculates
the packets {C1, . . . , C7} as indicated in the gure. The meaning of the graph is that a packet Cj
is the sum of the packets Pi it is attached to, bit by bit modulo 2. For instance,
C1 = P1, C2 = P1 +P2, C3 = P1 +P2 +P3,
and so on.
Now assume that the packets C1, C3, C5, C6 (shown shaded in Figure 5.13) are received
by the destination. Following the algorithm described above, one rst looks for a packet Cj that is
equal to one of the Pi. Here, one sees that C1 = P1. One then adds P1 to all the packets Cj that
used that packet. Thus, one replaces C3 by C3 +C1. One removes P1 from the graph and one is
left with the graph shown in the left part of Figure 5.14.
P2 P3 P4
C1 + C3 C5 C6
P3
C5 + C6
P2
C1 + C3
P2
C1 + C3
+ C5 + C6
Figure 5.14: Updates in decoding algorithm.
The algorithm then continues as shown in the right part of the gure. Summing up, one nds
successively that P1 = C1, P4 = C5, P3 = C5 +C6, and P2 = C1 +C3 +C5 +C6.
In practice, one chooses a distribution on the number Dof packets Pi that are used to calculate
each Cj. For instance, say that D is equally likely to be 1, 2, or 3. To calculate C1, one generates
the random variable D with the selected distribution. One then picks D packets randomly from
{P1, . . . , Pn} and one adds them up bit by bit modulo 2. One repeats the procedure to calculate
C2, . . . , Cm. Simulations show that with m 1.05 n, the algorithm has a good likelihood to
recover the original packets from any n of the packets {C1, . . . , Cm}. For instance, if n = 1000, one
needs to send 1050 packets so that any 1000 of these packets sufce to recover the original 1000
packets, with a high probability.
5.4.4 NETWORKCODING
Network coding is an in-network processing method that can increase the throughput of multicast
transmissions in a network. We explain that possibility in Figure 5.15.
In the gure, source S sends packets to both Y and Z. b
1
and b
2
represent the bits to be
transmitted. Say that the rate of every link in the network is R packets per second. The gure on
the left has two parallel links from W to X. By using the coding in the intermediate node, as shown
5.4. ANYCAST, MULTICAST 71
S
T U
W
X
Y Z
X
S
T U
W
X
Y Z
X
b
1
b
2
b
1
b
2
b
1
b
1
b
1
b
1
b
2
b
1
b
2
b
1
b
2
b
1
b
1
b
2
b
2
b
1
b
2
b
1
b
2
Figure 5.15: Multicast from S to X and Y without (left) and with (right) network coding.
in the right part of the gure, the network can deliver packets at the rate of 2R to both Y and Z.
Without network coding, as shown on the left, the network can only deliver one bit to Y and two bits
to Z (or, two bits to Y and one bit to Z). One can show that the network can achieve a throughput
of 1.5R to both Y and Z without network coding.
The general result about network coding and multicasting is as follows. Consider a general
network and assume that the feasible rate from S to any one node in N is at least R. Then, using
network coding, it is possible to multicast the packets from S to N at rate R. For instance, in the
network of Figure 5.15, the rate from S to Y is equal to 2R as we can see by considering the two
disjoint paths ST Y and SUWXY. Similarly, the rate fromS to Z is also 2R. Thus, one can multicast
packets from S to Y and Z at rate 2R.
Network coding can also be useful in wireless networks, as the following simple example
shows. Consider the network in Figure 5.16. There are two WiFi clients X and Y that communicate
via access point Z. Assume that X needs to send packet A to Y and that Y needs to send packet B
to X.
A B
A B A B
(1) (2)
(3)
X Y
Z
Figure 5.16: Network coding for a WiFi network.
The normal procedure for X and Y to exchange A and B requires transmitting four packets
over the wireless channel: packet A from X to Z, packet A from Z to Y, packet B from Y to Z,
and packet B from Z to X. Using network coding, the devices transmit only three packets: packet
72 5. ROUTING
A from X to Z, packet B from Y to Z, and packet A B (the bit by bit addition modulo twoof A
and B) broadcast from Z to X and Y. Indeed, X knows A so that when it gets C = A B, it can
recover B by calculating C A. Similarly, Y gets A as A = C B.
This saving of one packet out of four requires delaying transmissions and keeping copies.
Also, the savings are smaller if the trafc between X and Y is not symmetric. Thus, although this
observation is cute, its benets may not be worth the trouble in a real network.
5.5 ADHOCNETWORKS
A ad hoc network consists of a set of nodes that can talk directly to one another. For instance, WiFi
devices can be congured to operate in that mode instead of the prevalent infrastructure mode where
all communication takes place via an access point.
For instance, one can imagine cell phones communicating directly instead of going through
a cellular base station, or wireless sensor nodes relaying each others information until it can reach
an Internet gateway. Currently (in 2009), there are very few ad hoc networks other than military
networks of tanks on a battleeld and mesh networks that interconnect a set of WiFi access points
with wireless links, a technology that looks promising but has met only moderate commercial success.
The routing through a ad hoc network is challenging, especially when the nodes are moving.
One can imagine two types of routing algorithm: proactive and reactive. A proactive algorithm
calculates paths in the background to have them available when required. A reactive algorithm
calculates paths only on demand, when needed. The tradeoff is between the excess trafc due to
path computation messages, the delay to nd a good path, and the quality of the path. Hundreds of
papers compare many variations of protocols. We limit our discussion to a few examples.
5.5.1 AODV
The AODV (Ad Hoc On Demand Distance Vector, see (73)) routing protocol is on-demand. Essen-
tially, if node S wants to nd a path to node D, it broadcasts a route request to its neighbors "hello,
I am looking for a path to D. If one neighbor knows a path to D, it replies to node S. Otherwise,
the neighbors forward the request. Eventually, replies come back with an indication of the number
of hops to the destination.
The messages contain sequence numbers so that the nodes can use only the most recent
information.
5.5.2 OLSR
The Optimized Link State Routing Protocol (OLSR, see (74)) is an adaptation of a link state protocol.
The idea is that link sate messages are forwarded by a subset of the nodes instead of being ooded.
Thus, OLSR is a proactive algorithm quite similar to a standard link state algorithm.
5.6. SUMMARY 73
5.5.3 ANTROUTING
Ant routing algorithms are inspired by the way ants nd their way to a food source. The ants deposits
some pheromone as they travel along the trail from the colony to the food and back. The pheromone
evaporates progressively. Ants tend to favor trails along which the scent of pheromone is stronger.
Consequently, a trail with a shorter round-trip time tends to have a stronger scent and to be selected
with a higher probability by subsequent ants. Some routing algorithms use similar mechanisms, and
they are called ant routing algorithms.
5.5.4 GEOGRAPHICROUTING
Geographic routing, as the name indicates, is based on the location of the nodes. There are many
variations of such schemes. The basic idea is as follows. Say that each node knows its location and
that of its neighbors. When it has to send a message to a given destination, a node selects the
neighbor closest in terms of physical distance) to that destination and forwards the message to it.
There are many situations where this routing may end up in a dead end. In such a case, the node at
the end of the dead end can initiate a backtracking phase to attempt to discover an alternate path.
5.5.5 BACKPRESSUREROUTING
Backpressure routing is a form of dynamic routing that automatically adapts to changing link charac-
teristics. We explore that mechanism in the chapter on models.
5.6 SUMMARY
Routing is the selection of the path that packets follow in the network. We explained the following
ideas:
For scalability and exibility, the Internet uses a two-level routing scheme: nodes are grouped
into Autonomous Systems (domains), and the routing is decomposed into inter-domain and
intra-domain routing.
Inter-domain routing uses a path-vector protocol (BGP). This protocol enables to adapt to
inter-domain agreements of peering or transit. BGP may fail to converge if the policies of the
different domains are not consistent.
Intra-domain routing uses a distance vector protocol (based on the Bellman-Ford algorithm)
or a link state protocol (based on Dijkstras algorithm).
The shortest path algorithms extend directly to anycast routing.
For multicast, the shortest (Steiner) tree is not the tree of shortest paths and its computation is
hard. One approach to make multicast reliable is to use a packet erasure code. Network coding
can in principle increase the throughput of a multicast tree.
74 5. ROUTING
Routing in ad hoc networks is challenging as the topology of the network and its link char-
acteristic uctuate. We explained the main ideas behind AODV, OLSR, ant routing, and
geographic routing.
5.7 PROBLEMS
Figure 5.17: Figure for Routing Problem 1.
P5.1 (a) Run Dijkstras algorithm on the following network to determine the routing table for
node 3.
(b) Repeat (a) using Bellman-Ford algorithm.
Figure 5.18: Figure for Routing Problem 2.
P5.2 Consider the network conguration shown in Figure 5.18. Assume that each link has the same
cost.
5.7. PROBLEMS 75
(a) Run Bellman-Ford algorithm on this network to compute the routing table for the node
A. Show As distances to all other nodes at each step.
(b) Suppose the link A-B goes down. As a result, A advertises a distance of innity to B.
Describe in detail a scenario where C takes a long time to learn that B is no longer
reachable.
P5.3 Consider the network shown below and assume the following:
The network addresses of nodes are given by <AS>.<Network>.0.<node>, e.g., node
A has the address AS1.E1.0.A,
The bridge IDs satisfy B1 < B2 < B3 ,
H is not connected to AS2.E5 for part (a),
The BGP Speakers use the least-next-hop-cost policy for routing (i.e., among alternative
paths to the destination AS, choose the one that has the least cost on the rst hop), and
The network topology shown has been stable for a long enough time to allow all the
routing algorithms to converge and all the bridges to learn where to forward each packet.
(a) What route, specied as a series of bridges and routers, would be used if G wanted to
send a packet to A?
(b) Now, if node H was added to AS2.E5, and D tried to send a packet to it as soon as H was
added, what would happen? Specically, will the packet reach its destination and which
all links and/or networks would the packet be sent on?
(c) Starting fromthe network in (b), suppose AS2.R2 goes down. Outline in brief the routing
changes that would occur as a consequence of this failure. [Hint: Think about how the
change affects packets sent from AS1 to AS2, and packets sent from AS2 to AS1.]
P5.4 Consider a wireless network with nodes X and Y exchanging packets via an access point Z.
For simplicity, we assume that there are no link-layer acknowledgments. Suppose that X sends
packets to Y at rate 2R packets/sec and Y sends packets to X at rate R pacekts/sec; all the
packets are of the maximum size allowed. The access point uses network coding. That is,
whenever it can, it sends the "exclusive or" of a packet from X and a packet from Y instead of
sending the two packets separately.
(a) What is the total rate of packet transmissions by the three nodes without network coding?
(b) What is the total rate of packet transmissions by the three nodes with network coding?
76 5. ROUTING
Figure 5.19: Figure for Routing Problem 3.
5.8 REFERENCES
Peering and transit agreements are explained in (38). Dijkstras algorithm was published in (14).
The Bellman-Ford algorithm is analyzed in (6). The QoS routing problems are studied in (41). The
oscillations of BGP are discussed in (19). BGP is described in (68). Network coding was introduced
in (2) that proved the basic multicasting result. The wireless example is from (27). Packet erasure
codes are studied in (45). AODV is explained in (73) and OLSR in (74). For ant-routing, see (13).
For a survey of geographic routing, see (49).
77
C H A P T E R 6
Internetworking
The Internet is a collection of networks. The key idea is to interconnect networks that possibly use
different technologies, such as wireless, wired, optical, and whatever.
In this chapter, we explain how Internet interconnects Ethernet networks.
6.1 OBJECTIVE
The Internet protocols provide a systematic way for interconnecting networks. The general goal is
shown in Figure 6.1. The top part of the gure shows a collection of networks, each with its own
addressing and switching scheme, interconnected via routers.
R1 R2 A B
C
D
E
F
G
H
J
K
L
A
B
C
D
E
F
G
R1
R2
H
J
K
L
Figure 6.1: Interconnected Networks.
The bottom of Figure 6.1 shows the networks as seen by the routers. The key point is that
the routers ignore the details of the actual networks. Two features are essential:
Connectivity: All the devices on any given network can send packets to one another. To do this,
a device encapsulates a packet with the appropriate format for the network and the suitable
addresses. The network takes care of delivering the packet.
Broadcast-Capability: Each network is broadcast-capable. For instance, router R1 in Figure
6.1 can send a broadcast packet to the left-most network. That packet will reach the devices
A, B, C in that network.
78 6. INTERNETWORKING
Examples of networks that can be interconnected with IP include Ethernet, WiFi, Cable
networks, and many more. We focus on Ethernet in this chapter, but it should be clear how to adapt
the discussion to other networks.
Figure 6.2 shows one Ethernet network attached to a router. The Ethernet network has the
two features needed for interconnection: connectivity and broadcast-capability. The bottom of the
gure shows that the router sees the Ethernet network as a link that connects it directly to all the
hosts on the Ethernet.
1 2 3 4 5 6
7 8 9 1 0 1 1 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4x
1 1 x
5x
7 x
1 x Ethernet
A
1 2x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
1 1x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 1 11 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4x
1 1 x
5x
7 x
1 x Ethernet
A
1 2x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
1 1x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 11 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4x
1 1 x
5x
7 x
1 x Ethernet
A
1 2x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
1 1x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 1 1 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4 x
1 1 x
5x
7 x
1 x Ethernet
A
1 2 x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
1 1 x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 1 1 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4 x
1 1 x
5 x
7 x
1 x Ethernet
A
1 2 x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
11 x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 1 1 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4 x
1 1 x
5x
7 x
1 x Ethernet
A
1 2 x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
1 1 x
5 x
7 x
1 x
C
1 2 3 4 5 6
7 8 9 1 0 1 1 1 2
A B
1 2 x
6x
8 x
2 x
9 x
3 x
1 0 x
4 x
1 1 x
5 x
7 x
1 x Ethernet
A
1 2 x
6 x
8 x
2 x
9 x
3 x
10 x
4 x
11 x
5 x
7 x
1 x
C
..
..
.. ..
.. ..
..
A
B
R
Internet Internet
R
C
A
B C
L
..
L
Figure 6.2: Ethernet network as seen by router.
6.2 BASICCOMPONENTS: MASK, GATEWAY, ARP
We illustrate the interconnection of Ethernet networks with the setup of Figure 6.3. Figure 6.3
shows two Ethernet networks: a network with devices that have MAC addresses e1, e2, and e4 and
another network with devices that have MAC addresses e3 and e5. The two networks are attached
via routers R1 and R2. A typical situation is when the two Ethernet networks belong to the same
organization. For instance, these two networks could correspond to different oors in Cory Hall, on
the U.C. Berkeley campus. The networks are also attached to the rest of the Internet via a port of
R2. The rest of the Internet includes some other Ethernet networks of the Berkeley campus in our
example.
6.2.1 ADDRESSES ANDSUBNETS
In addition to a MACaddress, each device is given an IP address, as shown in the gure. This address
is based on the location. Moreover, the addresses are organized into subnets. For instance, in Figure
6.3 the Ethernet network on the left is one subnet and the one on the right is another subnet. The
6.2. BASICCOMPONENTS: MASK, GATEWAY, ARP 79
H1
R1
H2
H3
e1
e2
e3
e4 e5
IP1
IP2
IP3
R2
IP4
IP5
Internet
D
IP6
Figure 6.3: Interconnected Ethernet Networks.
IP addresses of the devices {e1, e2, e4} on the rst subnet have a different prex (initial bit string)
than the devices {e3, e5} on the second subnet.
The subnet mask species the number of leading bits in the IP addresses that are common to
all the device interfaces attached to that Ethernet network. For instance, say that the subnet masks
of the Ethernet networks are the rst 24 bits. This means that two IP addresses IPi and IPj of
interfaces attached to the same Ethernet network have the same rst 24 bits. If IPi and IPj are not
on the same Ethernet network, then their rst 24 bits are not the same. In the gure, the rst 24
bits of IP1, which we designate by IP1/24, are the same as IP2/24 but not the same as IP3/24.
6.2.2 GATEWAY
The gateway router of an Ethernet network is the router that connects that Ethernet network to
the rest of the Internet. For instance, the gateway router of the Ethernet network on the left (with
devices H1, H2) is R1. Host H1 knows the IP address of the interface of R1 on its Ethernet network.
The gateway router for the network on the right is R2.
6.2.3 DNS SERVER
The devices know the address of a local DNS server. For instance, if the Ethernet networks are in
Cory Hall, the local Ethernet server is that of the domain eecs.berkeley.edu. We discussed the basic
concepts behind DNS in the earlier chapters on the Internet architecture and principles.
6.2.4 ARP
The Address Resolution Protocol, or ARP, enables devices to nd the MAC address of interfaces on
the same Ethernet that corresponds to a given IP address. Here is how it works. Say that host H1
wants to nd the MAC address that corresponds to the IP address IP2. Host H1 sends a special
broadcast ARP packet on the Ethernet. That is, the destination address is the Broadcast MAC
Address represented by 48 ones. The packet has a eld that species that it is an ARP request, and
80 6. INTERNETWORKING
it also species the IP address IP2. Thus, the ARP request has the form[all | e1 | who is IP2?], where
the addresses mean to all devices, from e1. An Ethernet switch that receives a broadcast packet
repeats the packet on all its output ports. When a device gets the ARP request packet, it compares
its own IP address with the address IP2. If the addresses are not identical, the device ignores the
request. If IP2 is the IP address of the device, then it answers the request with an Ethernet packet
[e1 | e2 | I am IP2]. The source address of that reply tells H1 that e2 is the MAC address that
corresponds to IP2. Note that a router does not forward a packet with a broadcast Ethernet address.
6.2.5 CONFIGURATION
Summarizing, to attach an Ethernet network on the Internet, each device must have an IP address,
know the subnet mask, know the IP address of the gateway router and the IP address of the DNS
server. In addition, each device must implement ARP.
6.3 EXAMPLES
We explore how a device sends a packet to another device on the subnets we discussed for the
interconnected networks shown in Figure 6.3. We examine two separate cases: First, when the
devices are on the same subnet; second, when they are on different subnets.
6.3.1 SAMESUBNET
We rst examine how device H1 sends a packet to H2 assuming that H1 knows IP2 (see Figure
6.4). Since IP1/24 = IP2/24, H1 knows that IP2 is on the same Ethernet as H1. Accordingly, H1
needs to send the packet as an Ethernet packet directly to H2. To do this, H1 needs the MAC
address of IP2. It nds that address using ARP, as we explained above. Once it nds out e2, H1
forms the packet [e2 | e1 | IP1 | IP2 | X] where e2 is the destination MAC address, e1 is the source
MAC address, IP1 is the source IP address, IP2 is the destination IP address, and X is the rest of
the packet. Note that the IP packet [IP1 | IP2 | X] from H1 to H2 is encapsulated in the Ethernet
packet with the MAC addresses [e2 | e1].
6.3.2 DIFFERENTSUBNETS
We explain how H1 sends a packet to H3, assuming that H1 knows IP3. (See Figure 6.5.) Since
IP1/24 = IP3/24, H1 knows that H3 is not on the same Ethernet as H1. H1 must then send the
packet rst to the gateway. Using the IP address of the gateway R1, say IP4, H1 uses ARP to nd
the MAC address e4. H1 then sends the packet [e4 | e1 | IP1 | IP3 | X] to R1. Note that this is the
IP packet [IP1 | IP3 | X] from H1 to H3 that is encapsulated in an Ethernet packet from H1 to R1.
When it gets that packet, R1 decapsulates it to recover [IP1 | IP3 | X], and it consults its routing
table to nd the output port that corresponds to the destination IP3. R1 then sends that packet out
on that port to R2. On that link, the packet is encapsulated in the appropriate format. (The gure
shows a specic link header on that link.) When it gets the packet [IP1 | IP3 | X], router R2 consults
6.3. EXAMPLES 81
H1
e1
H2
e2
IP2
IP1
e4
R1 R2
e5
H3
e3
IP3
(1)
(2)
(3)
(1) : all|e1|Who is IP2?
(3) : e2|e1|IP1|IP2|X
(2) : e1|e2|I am IP2
Figure 6.4: Sending on the same subnet.
its routing table and nds that IP3 is on the same Ethernet as its interface e5. Using ARP, R2 nds
the MAC address e3 that corresponds to IP3, and it sends the Ethernet packet [e3 | e5 | IP1 | IP3
| X]. Note that the IP packet [IP1 | IP3 | X] is not modied across the different links but that its
encapsulation changes.
H1
e1
H2
e2
IP2
IP1
e4
R1 R2
e5
H3
e3
IP3
(1)
(2)
(3)
(1) : e4|e1|IP1|IP3|X
(2) : SH|IP1|IP3|X
(4)
(4) : e5|e3|I am IP3
(5)
(3) : all|e5|Who is IP3?
(5) : e3|e5|IP1|IP3|X
Figure 6.5: Sending on a different subnet.
6.3.3 FINDINGIP ADDRESSES
So far, we have assumed that H1 knows IP2 and IP3. What if H1 only knows the name of H2, but
not its IP address IP2? In that case, H1 uses DNS to nd IP2. Recall that H1 knows the IP address
of the DNS server. Using the same procedure as the one we just explained, H1 can send a DNS
request to discover IP2, and similarly for IP3.
82 6. INTERNETWORKING
6.3.4 FRAGMENTATION
In general, an IP packet might need to be fragmented . Say that [IP1 | IP3 | X] has 1500 bytes, which
is allowed by Ethernet networks. Assume that the link between routers R1 and R2 can only transmit
500 Byte packets (as could be the case for some wireless links). In that case, the network protocol IP
fragments the packet and uses the header of the packet to specify the packet identication and where
the fragments r in the packet. It is up to the destination H3 to put the fragments back together.
The details of how the header species the packet and the fragment positions are not essential
for us. Let us simply note that each IP packet contains, in addition to the addresses, some control
information such as that needed for fragmentation and reassembly by the destination. All this
information is in the IP header which is protected by an error detection code.
6.4 DHCP
When you need to attach your laptop to some Ethernet or a WiFi network, that laptop needs
an IP address consistent with that network. When you attach the laptop to another network, it
needs another IP address. Instead of reserving a permanent IP address for your laptop, the network
maintains a pool of addresses that it assigns when needed. The protocol is called Dynamic Host
Conguration Protocol (DHCP), and it works as follows.
When one attaches the laptop to the network, the laptop sends a DHCP request asking for an
IP address. The network has a DHCP server commonly implemented by a router on the network)
that maintains the list of available IP addresses. The DHCP server then allocates the address to the
laptop. Periodically, the laptop sends a request to renew the lease on the IP address. If it fails to do
so, as happens when you turn off your laptop or you move it somewhere else, the server puts the
address back in the pool of available addresses.
DHCP is also commonly used by ISPs to assign addresses to the devices of their customers.
This mechanism reduces the number of addresses that the ISP must reserve.
Note that a device that gets an IP address with the DHCP protocol has a temporary address
that other devices do not know. Consequently, that device cannot be a server.
6.5 NAT
In the late 1990s, Internet engineers realized that one would run out of IP addresses in a few years.
They then developed a new addressing scheme that uses 128 bits instead of 32. However, these
new addresses require a new version of the Internet Protocol, which necessitates some considerable
amount of conguration work in thousands of routers. As explained above, DHCPis one mechanism
that reduces the number of IPaddresses neededby allocatedthemtemporarily insteadof permanently.
The Network Address Translation (NAT) is another scheme that enables to reuse IP addresses.
Most home routers implement a NAT. With that device, the devices in the home network
use a set of IP addresses, called Private Addresses that are also used by many other home networks.
Figure 6.6 explains how NAT works. The trick is to use the port numbers of the transport protocol.
6.6. SUMMARY 83
IPa
IPb
IPc
NAT
IPx
[IPb | IPx | TCPm | TCPn | ]
[IPa | IPx | TCPb | TCPn | ]
[TCPb IPb, TCPm]
[IPx | IPa | TCPn | TCPb | ]
[IPx | IPb | TCPn | TCPm | ]
Figure 6.6: Network Address Translation.
To be able to associate byte streams or datagrams with specic applications inside a computer, the
transport protocol uses a port number that it species in the transport header of the packet. Thus,
at the transport layer, a packet has the following structure:
[ source IP | destination IP | source port | destination port | | data ].
For instance, the HTTP protocol uses port 80 whereas email uses port 25.
The NAT uses the port number as follows. The private addresses of devices inside the home
are IPb, IPc. The NAT has an address IPa. Assume that IPb sends a packet [IPb | IPx | TCPm |
TCPn | ] to a device with IP address IPx. In this packet, TCPm is the source port number and
TCPn the destination port number. The NAT converts this packet into [IPa | IPx | TCPb | TCPn |
] where TCPb is chosen by the NATdevice which notes that TCPb corresponds to (IPb, TCPm).
When the destination with address IPx and port TCPn replies to this packet, it sends a packet [IPx
| IPa | TCPn | TCPb | ]. When the NAT gets this packet, it maps back the port number TCPb
into the pair (IPb, TCPm), and it sends the packet [IPx | IPb | TCPn | TCPm | ] to device IPb.
Note that this scheme works only when the packet is initiated inside the home. It is not
possible to send a request directly to a device such as IPb, only to reply to requests from that device.
Some web servers exist that maintain a connection initiated by a home device so that it is reachable
from outside.
6.6 SUMMARY
The main function of the Internet is to interconnect networks such as Ethernet, WiFi, and cable
networks. We explained the main mechanisms.
84 6. INTERNETWORKING
Internet interconnects local area networks, each with its specic addressing technology and
broadcast capability.
When using subnetting, the IP addresses of the hosts on one subnet share the same prex
whose length is determined by a subnet mask. knows the address of the router to exit the
subnet and the address of a local DNS server. Subnetting compares the the prexes of the
addresses and uses ARP to translate the IP address into a local network address.
When a computer attaches to the network, it may get a temporary IP address using DHCP.
The NAT enables the use of replicated IP addresses in the Internet.
6.7 PROBLEMS
P6.1 (a) How many IP addresses need to be leased from an ISP to support a DHCP server that
uses NAT to service N clients, if every client uses at most P ports?
(b) If M unique clients request an IP address every day from the above mentioned DHCP
server, what is the maximum lease time allowable to prevent new clients from being
denied access assuming that requests are uniformly spaced throughout the day, and that
the addressing scheme used supports a max of N clients?
P6.2 Below is the DNS record for a ctitious corporation, OK Computer:
Table 6.1: Table for Internetworking Problem 2.
Name Type Value TTL (Seconds)
okcomputer.com A 164.32.15.98 86400 (1 day)
okcomputer.com NS thom.yorke.net 86400
okcomputer.com NS karma.okcomputer.com 86400
okcomputer.com MX android.okcomputer.com 60
lucky.okcomputer.com A 164.32.12.8 86400
www.okcomputer.com CNAME lucky.okcomputer.com 86400
android.okcomputer.com A 164.32.15.99 86400
(a) If you type http://www.okcomputer.com into your web browser, to which IP address will
your web browser connect?
(b) If you send an e-mail to thom@okcomputer.com, to which IP address will the message
get delivered?
(c) The TTL eld refers to the maximum amount of time a DNS server can cache the
record. Give a rationale for why most of the TTLs were chosen to be 86400 seconds (1
day) instead of a shorter or a longer time, and why the MX record was chosen to have a
60-second TTL?
6.8. REFERENCES 85
6.8 REFERENCES
The Address Resolution Protocol is described in (62), subnetting in (64), DHCP in (70), and NAT
in (72).
87
C H A P T E R 7
Transport
The transport layer supervises the end-to-end delivery across the Internet between a process in a
source device anda process ina destinationdevice.The transport protocol of Internet implements two
transport services: a connection-oriented reliable byte stream delivery service and a connectionless
datagram delivery service. This chapter explains the main operations of this layer.
7.1 TRANSPORTSERVICES
The network layer of the Internet (the Internet Protocol, IP) provides a basic service of packet
delivery from one host to another host. This delivery is not reliable. The transport layer adds a few
important capabilities such as multiplexing, error control, congestion control, and ow control, as
we explain in this chapter.
Figure 7.1 shows the protocol layers in three different Internet hosts. The transport layer
denes ports that distinguish information ows. These are logical ports, not to be confused with
the physical ports of switches and routers. The host on the left has two ports p1 and p2 to which
application processes are attached. Port p2 in that host is communicating with port p1 in the middle
host. The protocol HTTP is attached to port p1 in the middle host.
IP
Transport
A B C
[A | B | p1 | p2 | ]
p1 p2 p1 p2 p3 p1 p2
ports
Application
HTTP DNS RA
Figure 7.1: The transport layer supervises the delivery of information between ports in different hosts.
The ports are identied by a number from 1 to 65,535 and are of three types: well-known
ports (1-1,023) that correspond to xed processes, registered ports (1,024-49,451) that have been
88 7. TRANSPORT
registered by companies for specic applications, and the dynamic and/or private ports that can be
assigned dynamically. For instance, the basic email protocol SMTP is attached to port 25, HTTP
to port 80, the real time streaming protocol RTSP to port 540, the game Quake Wars to port 7133,
Pokemon Netbattle to port 30,000, traceroute to 33,434, and so on.
Thus, at the transport layer, information is delivered from a source port in a host with a source
IP address to a destination port in a host with some other IP address.
The transport layer implements two protocols between two hosts: UDP (the User Datagram
Protocol) and TCP (the Transmission Control Protocol) with the following characteristics:
UDP delivers individual packets. The delivery is not reliable.
TCPdelivers a byte stream. The byte streamis reliable: the two hosts arrange for retransmission
of packets that fail to arrive correctly. Moreover, the source regulates the delivery to control
congestion in routers (congestion control) and in the destination device (ow control).
Summarizing, an information ow is identied by the following set of parameters: (source IP
address, source transport port, destination IPaddress, destination transport port, protocol), where the
protocol is UDP or TCP. The port numbers enable to multiplex the packets of different applications
that run on a given host.
7.2 TRANSPORTHEADER
Figure 7.2 shows the header of every UDP packet. The header species the port numbers and the
length of the UDP packet (including UDP header and payload, but not the IP header). The header
has an optional UDP checksum calculated over the UDP packet.
Source port Destination port
0 16 31
UDP length UDP checksum
Payload (variable)
Figure 7.2: The header of UDP packets contains the control information.
Each TCP packet has a header shown in Figure 7.3. In this header, the sequence number,
acknowledgement, and advertised window are used by the Go Back N protocol explained later in
this chapter. The Flags are as follows:
SYN, FIN: establishing/terminating a TCP connection;
ACK: set when Acknowledgement eld is valid;
URG: urgent data; Urgent Pointer says where non-urgent data starts;
7.3. TCP STATES 89
Source Port Destination Port
Sequence Number
Acknowledgment
HdrLen Reserved Flags
Advertised Window
Checksum Urgent Pointer
Options (variable)
Payload (variable)
Figure 7.3: The header of TCP packets contains the control information.
PUSH: dont wait to ll segment;
RESET: abort connection.
The precise functions of the main elds are described below. The UDP header is a stripped-
down version of a TCP header since its main function is multiplexing through port numbers.
7.3 TCP STATES
ATCP connection goes through the following phases: setup, exchange data, close. (See Figure 7.4.)
The setup goes as follows. The client sets up a connection with a server in three steps:
The client sends a SYN packet (a TCP packet with the SYN ag set identifying it as a
SYNchronization packet). The SYN packet species a random number X.
The server responds to the SYN with a SYN.ACK (i.e., by setting both SYN and ACK ags)
that species a random number Y and an ACK with sequence number X +1.
1
The client sends the rst data packet with sequence number X +1 and an ACKwith sequence
number Y +1.
The hosts then exchange data and acknowledge each correct data packet, either in a data
packet or in a packet that contains only an ACK. When host A has sent all its data, it closes its
connection to the other host, host B, by sending a FIN; A then waits for a FIN.ack. Eventually,
host B also closes its connection by sending a FIN and waiting for a FIN.ack. After sending its
FIN.ack, host A waits to make sure that host B does not send a FIN again that host A would then
acknowledge. This last step is useful in case the FIN.ack from A gets lost.
1
For simplicity, we talk about the packet sequence numbers. However, TCP actually implements the sequence numbers as the ones
associated with the bytes transported.
90 7. TRANSPORT
A
B
Closed
Listen
SYN sent
Established
SYN received Established
...
FIN Wait-1
Close Wait
FIN Wait-2
Last ACK
Timed Wait
Closed
Closed
(1)
SYN
SYN ACK
DATA ACK
ACK
FIN
FIN.ACK
FIN FIN FIN
FIN FIN.ACK
(1) A waits in case B retransmits FIN and A must ack again
Figure 7.4: The phases and states of a TCP connection.
7.4 ERRORCONTROL
When using TCP, the hosts control errors by retransmitting packets that are not acknowledged
before a timeout. We rst describe a simplistic scheme called Stop-and-Wait and then explain the
Go Back N mechanism of TCP.
7.4.1 STOP-AND-WAIT
The simplest scheme of retransmission after timeout is stop-and-wait. When using that scheme, the
source sends a packet, waits for up to T seconds for an acknowledgment, retransmits the packet if the
acknowledgment fails to arrive, and moves on to the next packet when it gets the acknowledgment.
Even if there are no errors, the source can send only one packet every T seconds, and T has to be as
large as a round-trip time across the network. Atypical round trip time in the Internet is about 40ms,
If the packets have 1,500Bytes, this corresponds to a rate equal to 1, 500 8/0.04 = 300Kbps. If
the link rates are larger, this throughput can be increased by allowing the source to send more than
one packet before it gets the rst acknowledgment. That is precisely what Go Back N does.
7.4.2 GOBACKN
The basic version of TCP uses the following scheme called Go Back N and illustrated in Figure 7.5.
The top part of the gure shows a string of packets with sequence numbers 1, 2, . . . that the source
(one the left) sends to the destination (on the right). In that part of the gure, the destination has
received packets 1, 2, 3, 4, 5, 7, 8, 9. It is missing packet 6.
The scheme species that, at any time, the source can have sent up to N packets that have not
yet been acknowledged (N = 4 in the gure). When it gets a packet, the destination sends an ACK
with the sequence number of the next packet it expects to receive, in order. Thus, if the receiver gets
the packets 1, 2, 3, 4, 5, 7, 8, 9, it sends the ACKs 2, 3, 4, 5, 6, 6, 6, 6 as it receives those packets. If
7.4. ERRORCONTROL 91
an ACK for a packet fails to arrive after some time, the source retransmits that packet and possibly
the subsequent N 1 packets.
The source slides its window of size N so that it starts with the last packet that has not been
acknowledged in sequence. This window referred to as Transmit Window in the gure species
the packets that the source can transmit. Assume that the source retransmits packet 6 and that this
packet arrives successfully at the receiver. The destination then sends back an acknowledgment with
sequence number 10. When it gets this ACK, the source moves its transmission window to [10, 11,
12, 13] and sends packet 10. Note that the source may retransmit packets 7, 8, 9 right after packet
6 before the ACK 10 arrives.
Transmit Window Receiver
Next expected packet
12 11 10 9 8 7 6 5 4 3 2 1 9 8 7 - 5 4 3 2 1
6
12 11 10 9 8 7 6 5 4 3 2 1 9 8 7 - 5 4 3 2 1 6
12 11 10 9 8 7 6 5 4 3 2 1 9 8 7 - 5 4 3 2 1
10
12 11 10 9 8 7 6 5 4 3 2 1 9 8 7 - 5 4 3 2 1 10
Figure 7.5: Go Back N Protocol.
If there are no errors, Go Back N sends N packets every round trip time (assuming that the
N transmission times take less than one round trip time). Thus, the throughput of this protocol is
N times larger than that of Stop-and-Wait.
7.4.3 SELECTIVEACKNOWLEDGMENTS
Go Back N may retransmit packets unnecessarily, as shown in Figure 7.6. The gure shows the
packets and acknowledgments transmitted by the source and receiver, respectively, along with the
corresponding packet or acknowledgment numbers. The top of the gure shows Go Back N with a
window of size 4 when the second packet gets lost. After some time, the sender retransmits packets
2, 3, 4 and transmits packet 5.
The destination had already received packets 3 and 4, so that these retransmissions are unnec-
essary; they add delay and congestion in the network. To prevent such unnecessary retransmissions,
a version of TCP uses selective acknowledgments, as illustrated in the bottom part of the gure. When
92 7. TRANSPORT
1 2 3 4 2 3 4 5
2 2 2
1 2 3 4 2 5 6 7 8
1 1
3
1
3
4
Figure 7.6: Comparing Go Back N (top) and selective ACKs (bottom).
using this scheme, the receiver sends a positive ACK for all the packets it has received correctly.
Thus, when it gets packet 3, the receiver sends an ACK that indicates it received packets 1 and 3.
When the sender gets this ACK, it retransmits packet 2, and so on. The gure shows that using
selective acknowledgments increases the throughput of the connection.
A special eld in the SYN packet indicates if the sender accepts selective ACKs. The receiver
then indicates in the SYN.ACK if it will use selective ACKs. In that case, elds in the TCP header
of ACK packets indicate up to four blocks of contiguous bytes that the receiver got correctly. For
instance, the elds could contain the information [3; 1001, 3000; 5001, 8000; 9001, 12000] which
would indicate that the receiver got all the bytes from 1001 to 3000, also from 5001 to 8000, and
from 9001 to 12000. The length eld that contains the value 3 indicates that the remaining elds
specify three blocks of contiguous bytes. See (69) for details.
7.4.4 TIMERS
How long should the source wait for an acknowledgment? Since the round trip time varies greatly
froma connection to another, the timeout value must be adapted to the typical round trip time of the
connection. To do this, the source measures the round trip times T
n
, n = 1, 2, . . . for the successive
packets. That is, T
n
is the time between the transmission of the n-th packet and the reception of the
corresponding acknowledgment. (The source ignores the packets that are not acknowledged in this
estimation.) The source then calculates the average value A
n
of the round-trip times {T
1
, . . . , T
n
}
and the average deviation D
n
around that mean value. It then sets the timeout value to A
n
+4D
n
.
The justication is that it is unlikely that a round trip time exceeds A
n
+4D
n
if the acknowledgment
arrives at all.
The source calculates A
n
and D
n
recursively as exponential averages dened as follows (b < 1
is a parameter of the algorithm):
A
n+1
= (1 b)A
n
+bT
n+1
and D
n+1
= (1 b)D
n
+b|T
n+1
A
n+1
|, n 1
with A
1
= T
1
and D
1
= T
1
.
7.5. CONGESTIONCONTROL 93
0
50
100
150
200
T
n
n
A
n
with b = 0.5
A
n
with b = 0.1
Figure 7.7: Exponential averaging.
Figure 7.7 shows the exponential averaging of times T
n
indicated by squares on the graph for
different values of the parameter b. As the graph shows, when b is small, A
n
calculates the average
of the T
n
over a longer period of time and is less sensitive to the recent values of those times.
7.5 CONGESTIONCONTROL
In the Internet, multiple ows share links. The devices do not know the network topology, the
bandwidth of the links, nor the number of ows that share links. The challenge is to design a
congestion control scheme that enables the different ows to share the links in a reasonably fair way.
7.5.1 AIMD
The TCP congestion algorithm is AIMD (additive increase multiplicative decrease). This scheme
attempts to share the links fairly among the connections that use them.
Consider two devices A and B shown in Figure 7.8 that send ows with rates x and y
that share a link with bandwidth C. The sources increase x and y additively as long as they receive
acknowledgments. When they miss acknowledgments, they suspect that a router had to drop packets,
which happens here soon after x +y exceeds C. The sources then divide their rate by 2 and resume
increasing them. Following this procedure, say that the initial pair of rate (x, y) corresponds to the
point 1 in the right part of the gure. Since the sum of the rates is less than C, the link buffer does
not accumulate a backlog, the sources get the acknowledgments, and they increase their rate linearly
over time, at the same rate (ideally). Accordingly, the pair of rates follows the line segment from
94 7. TRANSPORT
point 1 to point 2. When the sum of the rates exceeds C, the buffer arrival rate exceeds the departure
rate and the buffer starts accumulating packets. Eventually, the buffer has to drop arriving packets
when it runs out of space to store them. A while later, the sources notice missing acknowledgments
and they divide their rate by a factor 2, so that the pair of rates jumps to point 3, at least if we assume
that the sources divide their rate at the same time. The process continues in that way, and one sees
that the pair of rates eventually approaches the set S where the rates are equal.
The scheme works for an arbitrary number of ows. The sources do no need to know the
number of ows nor the rate of the link they share. You can verify that a scheme that would increase
x
y
C
x
y
C
1
2
3
4
S
A
B
Figure 7.8: Additive Increase, Multiplicative Decrease.
the rates multiplicatively or decrease them additively would not converge to equal rates. You can also
check that if the rates do not increase at the same pace, the limiting rates are not equal.
7.5.2 REFINEMENTS: FASTRETRANSMITANDFASTRECOVERY
Assume that the destination gets the packets 11, 12, 13, 15, 16, 17, 18. It then sends the ACKs 12,
13, 14, 14, 14, 14, 14. As it gets duplicate ACKs, the source realizes that packet 14 has failed to
arrive. The fast retransmit scheme of TCP starts retransmitting a packet after three duplicate ACKs,
thus avoiding having to wait for a timeout.
The fast recovery scheme of TCP is designed to avoid having to wait a round-trip time after
having divided the window size by 2. To see how this scheme works, rst note how the normal
(i.e.. without fast recovery) scheme works (see the top part of Figure 7.9). Assume that the receiver
gets the packets 99, 100, 102, 103, . . . , 132 and sends the ACKs 100, 101, 101, 101, . . . , 101.
Assume also that the window size is 32 when the sender sends packet 132. When the source gets
the third duplicate ACK with sequence number 101, it reduces its window to 32/2 = 16 and starts
retransmitting 101, 102, 103, . . . , 117. Note that the source has to wait one round-trip time after
it sends the copy of 101 to get the ACK 133 and to be able to transmit 133, 134, . . ..
7.5. CONGESTIONCONTROL 95
101
4
3
rd
DA
132 101
101
3
rd
DA
132 101
Without Fast Recovery:
With Fast Recovery:
W: 32/2 = 16
133
133
147
148
Must wait
Figure 7.9: TCP without (top) and with Fast Recovery.
A better scheme, called fast recovery is shown in the bottom part of Figure 7.9. With this
scheme, the source sends packets 101, 133, 134, . . . , 147 by the time it gets the ACK of the copy
of 101 (with sequence number 133). In that way, there are exactly 15 unacknowledged packets
(133, 134, . . . , 147). With the new window of 32/2 = 16, the source is allowed to transmit Packet
148 immediately, and the normal window operation can resume with the window size of 16. Note
that the idle waiting time is eliminated with this scheme.
To keep track of what to do, the source modies its window as follows. When it gets the third
duplicate ACK 101, the sender sends a copy of 101 and changes its window size from W = 32 to
W/2 +3 = 32/2 +3 = 19. It then increases its window by one whenever it gets another duplicate
ACK of 101. Since the source already got the original and three duplicate ACKs from the W = 32
packets 100, 102, 103, . . . , 132, it will receive another W 4 = 28 duplicate ACKs andwill increase
its windowby W 4 to W/2 +3 +(W 4) = W +W/2 1 = 47, and it can send all the packets
up to 147 since the last ACK it received is 100 and the set of unacknowledged packets is then
{101, 102, . . . , 147}. Once the lost packet is acknowledged, the window size is set to W/2, and since
now there are W/2 1 outstanding packets, the source can send the next packet immediately.
7.5.3 ADJUSTINGTHERATE
As long as no packet gets lost, the sliding window protocol sends N packets every round trip time
(the time to get an ACK after sending a packet). TCP adjusts the rate by adjusting the window size
N. The basic idea is to approximate AIMD. Whenever it gets an ACK, the sender replaces N by
N +1/N. In a round-trip time, since it gets N ACKs, the sender ends up adding approximately 1/N
about N times during that round-trip time, thus increasing the window size by 1 packet. Thus, the
96 7. TRANSPORT
source increases its rate by about 1 packet per round trip time every round trip time. This increase in
the window size enables the sources to take full advantage of the available bandwidth as the number
of connections changes in the network. When the source misses an ACK, it divides its window size
by 2. Note that the connections with a shorter round trip time increase their rate faster than others,
which results in an unfair advantage.
This scheme might take a long time to increase the windowsize to the acceptable rate on a fast
connection. To speed up the initial phase, the connection starts by doubling its window size every
round trip time, until it either misses an ACK or the window size reaches a threshold (as discussed
in the next section). This scheme is called Slow Start. To double the window size in a round-trip
time, the source increases the window size by 1 every time it gets an ACK. Thus, if the window side
was N, the source should get N ACKs in the next round-trip time and increase the window by N,
to 2N. When the source misses an ACK, after a timeout, it restarts the slow start phase with the
window size of 1.
When using selective acknowledgments, the window can grow/shift without worrying about
the lost packets. That is, if the window size is 150 and if the acknowledgments indicate that the
receiver got bytes [0, 1199], [1250, 1999], [2050, 2999], then the source can send bytes [1200, 1249],
[2000, 2049], [3000, 3049]. Note that the gapbetweenthe last byte that was not acknowledged(1200)
and the last byte that can be sent (3049) is larger than the window size, which is not possible when
using the Go Back N with cumulative acknowledgments. The rules for adjusting the window are
the same as with cumulative acknowledgments.
7.5.4 TCPWINDOWSIZE
Figure 7.10 shows howthe TCPwindowsize changes over time. During the slowstart phase (labeled
SS in the gure), the window increases exponentially fast, doubling every round-trip time. When a
timeout occurs when the window size is W
0
, the source remembers the value of W
0
/2 and restarts
with a window size equal to 1 and doubles the window size every round-trip time until the window
size reaches W
0
/2. After that time, the protocol enters a congestion avoidance phase (labeled CA).
During the CA phase, the source increase the window size by one packet every round-trip time.
If the source sees three duplicate ACKs, it performs a fast retransmit and fast recovery. When a
timeout occurs, the source restarts a slow start phase.
7.5.5 TERMINOLOGY
The various modications of TCP received new names. The original version, with Go Back N,
is called TCP Tahoe. With fast retransmit, the protocol is called TCP Reno. When fast recovery
is included, the protocol is called TCP New Reno; it is the most popular implementation. When
selective ACKs are used, the protocol is TCP-SACK. There are also multiple other variations that
have limited implementations.
7.6. FLOWCONTROL 97
W
1
64KB
X0.5
TO
3DA
X0.5
3DA
TO
X0.5
X0.5
SS CA SS
CA
3
3
Figure 7.10: Evolution in time of the TCP window.
7.6 FLOWCONTROL
Congestion control is the mechanism to prevent saturating routers. Another mechanism, called ow
control, prevents saturating the destination of an information ow and operates as follows in TCP.
The end host of a TCP connection sets aside some buffer space to store the packets it receives until
the application reads them. When it sends a TCP packet, that host indicates the amount of free
buffer space it currently has for that TCP connection. This quantity is called the Receiver Advertised
Window (RAW). The sender of that connection then calculates RAW OUT where OUT is the
number of bytes that it has sent to the destination but for which it has not received an ACK. That
is, OUT is the number of outstanding bytes in transit from the sender to the receiver. The quantity
RAW OUT is the number of bytes that the sender can safely send to the receiver without risking
to overow the receiver buffer.
The sender then calculates the minimumof RAW OUT and its current congestion window
anduses that minimumto determine the packets it cantransmit.This adjustment of theTCPwindow
combines congestion control and ow control.
7.7 SUMMARY
The Transport Layer is a set of end-to-end protocols that supervise the delivery of packets.
The transport layer offers two services: reliable byte stream (with TCP) and unreliable packet
delivery (with UDP) between ports.
A TCP connection goes through the following phases: Open (with a three-way handshake:
SYN, SYN.ACK, then ACK and start); Data exchange (with data and acks); half-close (with
FIN and FIN.ACK); then a second half-close (with FIN, timed wait, and FIN.ACK).
98 7. TRANSPORT
The error control in TCP uses a sliding window protocol based on Go Back N with either
cumulative or selective ACKs.
Flow control uses the receiver-advertized window.
Congestion control attempts to share the links equally among the active connections. The
mechanism is based on additive increase, multiplicative decrease. The timers are adjusted
based on the average of the round trip times and the average uctuation around the average
value.
Renements include fast retransmit after three duplicate ACKs and fast recovery by increasing
the window size while waiting for the ACK of a retransmitted packet.
7.8 PROBLEMS
P7.1 (a) Suppose you and two friends named Alice and Bob share a 200 Kbps DSL connection
to the Internet. You need to download a 100 MB le using FTP. Bob is also starting a
100 MB le transfer, while Alice is watching a 150 Kbps streaming video using UDP.
You have the opportunity to unplug either Alice or Bobs computer at the router, but you
cant unplug both. To minimize the transfer time of your le, whose computer should
you unplug and why? Assume that the DSL connection is the only bottleneck link, and
that your connection and Bobs connection have a similar round trip time.
(b) What if the rate of your DSL connection were 500Kbps? Again, assuming that the DSL
connection were the only bottleneck link, which computer should you unplug?
P7.2 Suppose Station A has an unlimited amount of data to transfer to Station E. Station A uses
a sliding window transport protocol with a xed window size. Thus, Station A begins a new
packet transmission whenever the number of unacknowledged packets is less than W and any
previous packet being sent from A has nished transmitting.
The size of the packets is 10000 bits (neglect headers). So, for example, if W > 2, station A
would start sending packet 1 at time t = 0, and then would send packet 2 as soon as packet 1
nished transmission, at time t = 0.33 ms. Assume that the speed of light is 3 10
8
meters/sec.
(a) Suppose station B is silent, and that there is no congestion along the acknowledgement
path from C to A. (The only delay acknowledgements face is the propagation delay to
and from the satellite.) Plot the average throughput as a function of window size W.
What is the minimum window size that A should choose to achieve a throughput of 30
Mbps? Call this value W
?
Figure 7.12: Figure for Transport Problem 3.
P7.3 As shown in the gure, ows 1 and 2 share a link with capacity C = 120 Kbps. There is no
other bottleneck. The round trip time of ow 1 is 0.1s and that of ow 2 is 0.2s. Let x
1
and
x
2
denote the rates obtained by the two ows, respectively. The hosts use AIMD to regulate
their ows. That is, as long as x
1
+x
2
< C, the rates increase linearly over time: the window
of a ow increases by one packet every round trip time. Rates are estimated as the window size
divided by the round-trip time. Assume that as soon as x
1
+x
2
> C, the hosts divide their
rates x
1
and x
2
by the factor = 1.1.
(a) Draw the evolution of the vector (x
1
, x
2
) over time.
(b) What is the approximate limiting value for the vector?
P7.4 Consider a TCP connection between a client C and a server S.
100 7. TRANSPORT
(a) Sketch a diagram of the window size of the server S as a function of time.
(b) Using the diagram, argue that the time to transmit N packets fromSto Cis approximately
equal to a +bN for large N.
(c) Explain the key factors that determine the value of b in that expression for an Internet
connection between two computers that are directly attached by an Ethernet link.
(d) Repeat question c when one of the computers is at UCB and the other one at MIT.
7.9 REFERENCES
The basic specication of TCP is in (63). The analysis of AIMD is due to Chiu and Jain (see (11)
and (12)). The congestion control of TCP was developed by Van Jacobson in 1988 (22). See (71)
for a discussion of the renements such as fast retransmit and fast recovery.
101
C H A P T E R 8
Models
The goal of this more mathematically demanding chapter is to explain some recent insight into the
operations of network protocols. Two points are noteworthy. First, TCP is an approximation of a
distributed algorithm that maximizes the total utility of the active connections. Second, a new class
of backpressure protocols optimize the scheduling, routing, and congestion control in a unied way.
8.1 THEROLEOF LAYERS
The protocols regulate the delivery of information in a network by controlling errors, congestion,
routing, and the sharing of transmission channels. As we learned in Chapter 2, the original view was
that these functions are performed by different layers of the network: layer 4 for error and congestion
control, 3 for routing, and 2 for the medium access control. This layer decomposition was justied
by the structure that it imposes on the design and, presumably, because it simplies the solution of
the problem.
However, some suspicion exists that the forced decomposition might result in a loss of
efciency. More worrisome is the possibility that the protocols in the different layers might interact
in a less than constructive way. To address this interaction of layers, some researchers even proposed
some cross-layer designs that sometime resembled strange Rube Goldberg contraptions.
Recently, a new understanding of control mechanisms in networks emerged through a series
of remarkable papers. Before that work, schemes like TCP or CSMA/CA seemed to be clever but
certainly ad hoc rules for controlling packet transmissions: slowdownif the network seems congested.
For many years, nobody really thought that such distributed control rules were approaching optimal
designs. The control community by and large did not spend much time exploring TCP.
The new understanding starts by formulating a global objective: maximize the total utility
of the ows in the network. The next step is to analyze the problem and show that, remarkably, it
decomposes into simpler problems that can be thought of as different protocols such as congestion
control and routing. One might argue that there is little point in such after-the-fact analysis. How-
ever, the results of the analysis yield some surprises: improved protocols that increase the utility of
the network. Of course, it may be a bit late to propose a new TCP or a new BGP. However, the
new protocols might nally produce multi-hop wireless networks that work. Over the last decade,
researchers have been developing multiple heuristics for routing, scheduling, and congestion control
in multi-hop wireless networks. They have been frustrated by the poor performance of those proto-
cols. A common statement in the research community is that three hops = zero throughput." The
protocols derived from the theoretical analysis hold the promise of breaking this logjam. Moreover,
102 8. MODELS
the insight that the results provides is valuable because it shows that protocols do not have to look
like fairly arbitrary rules but can be derived systematically.
8.2 CONGESTIONCONTROL
Chapter 7 explained the basic mechanism that the Internet uses to control congestion. Essentially,
the source slows down when the network gets congested. We also saw that additive increase, multi-
plicative decrease has a chance to converge to a fair sharing of one link by multiple connections. A
major breakthrough in the last ten years has been a theoretical understanding of how it is possible
for a distributed congestion control protocol to achieve such a fair sharing in a general network. We
explain that understanding in this section.
8.2.1 FAIRNESS VS. THROUGHPUT
Figure 8.1 shows a network with three ows and two nodes A and B. Link a from node A to node
B has capacity 1 and so does link b out of node B. We denote the rates of the three ows by x
1
, x
2
,
and x
3
. The feasible values of the rates (x
1
, x
2
, x
3
) are nonnegative and such that
A B
x
1
x
2
x
3
1 1
Figure 8.1: Three ows in a simple network.
g
a
(x) := x
1
+x
2
1 0 and g
b
(x) := x
1
+x
3
1 0.
For instance, (0, 1, 1) is feasible and one can check that these rates achieve the maximum value 2 of
the total throughput x
1
+x
2
+x
3
of the network. Of course, this set of rates is not fair to the user
of ow 1 who does not get any throughput. As another example, the rates (1/2, 1/2, 1/2) are as
fair as possible but achieve only a total throughput equal to 1.5. This example illustrates that quite
typically there is a tradeoff between maximizing the total throughput and fairness.
To balance these objectives, let us assign a utility u(x
i
) to each ow i (i = 1, 2, 3), where u(x)
is a concave increasing function of x > 0. The utility u(x) reects the value to the user of the rate x
of the connection. The value increases with the rate. The concavity assumption means that there is
a diminishing return on larger rates. That is, increasing the rate by 1Kbps is less valuable when the
rate is already large.
8.2. CONGESTIONCONTROL 103
We choose the rates x = (x
1
, x
2
, x
3
) that maximize the sum f (x) = u(x
1
) +u(x
2
) +u(x
3
)
of the utilities of the ows. For instance, say that
u(x) =
1
1
x
1
, for = 1
log(x), for = 1.
(8.1)
For = 1, the derivative of this function is x
2(1 x)
= 0.
Hence,
x = 2
1/
(1 x), so that x =
1
1 +2
1/
.
Note that (x, 1 x, 1 x) goes from the maximum throughput rates (0, 1, 1) to the perfectly fair
rates (1/2, 1/2, 1/2) as goes from 0 to .
Figure 8.2 shows the values of x
1
, x
2
, x
3
that maximize the sum of utilities as a function of
. As the graphs show, as goes from 0 to , the corresponding rates go from the values that
maximize the sum of the rates to those that maximize the minimum rate.
With the specic form of the function u() in (8.1), the maximizing rates x are said to be
-fair. For = 0, the objective is to maximize the total throughput, since in that case u(x) = x and
f (x) = x
1
+x
2
+x
3
.
For , the rates that maximize f (x) maximize the minimumof the rates {x
1
, x
2
, x
3
}. To
see this, note that u
(x) = x
(y) = y
(x
1
) u
(x
2
)
u
(x
3
)], which is positive if is large enough.
For a general network, this argument shows that if 1, the rates x that maximize f (x)
must be such that it is not possible to increase some x
i
without decreasing some x
j
that is smaller
than x
i
. Rates with that property are called the max-min fair rates. To see why only max-min rates
104 8. MODELS
0
0.25
0.50
0.75
1.00
0.3 0.9 1.5 2.1 2.7 3.3 3.9 4.5 5.1 5.7 6.3 6.9 7.5 8.1 8.7
x
1
x
2
= x
3
Max-Min
Max-Sum
Figure 8.2: Rates that maximize the sum of utilities as a function of .
can maximize f (x) when 1, assume that one can replace x
i
by x
i
+ by only decreasing some
x
j
s that are larger than x
i
. In that case, the previous argument shows that the net change of f (x)
is positive after such a change, which would contradict the assumption that the rates maximize the
function.
This example shows that one can model the fair allocation of rates in a network as those
rates that maximize the total utility of the users subject to the capacity constraints imposed by the
links. Moreover, by choosing , one adjusts the tradeoff between efciency (maximizing the sum of
throughputs) and strict fairness (maximizing the utility of the worst-off user).
8.2.2 DISTRIBUTEDCONGESTIONCONTROL
In the previous section, we calculated the rates x that maximize the total utility f (x) assuming a
complete knowledge of the network. In practice, this knowledge is not available anywhere. Instead,
the different sources control their rates based only on their local information. We consider that
distributed problem next.
We use the following result that we explain in the appendix.
Theorem 8.1 Let f (x) be a concave function and g
j
(x) be a convex function of x for j = a, b. Then
x
maximizes L(x,
) := f (x)
a
g
a
(x)
b
g
b
(x), (8.4)
8.2. CONGESTIONCONTROL 105
for some
a
0 and
b
0 such that
j
g
j
(x
) = 0, for j = a, b. (8.5)
Moreover, if
L(x(), ) = max
x
L(x, ),
then the variables
a
and
b
minimize
L(x(), ). (8.6)
The function L is called the Lagrangian and the variables (
a
,
b
) the Lagrange multipliers or
shadow prices. The relations (8.5) are called the complementary slackness conditions. The problem (8.6)
is called the dual of (8.3).
To apply this result to problem (8.2), we rst compute the Lagrangian:
L(x, ) = f (x)
a
g
a
(x)
b
g
b
(x)
= {u(x
1
) (
a
+
b
)x
1
} +{u(x
2
)
a
x
2
} +{u(x
3
)
b
x
3
}.
Second, we nd the value x() of x that maximizes L(x, ) for a xed value of = (
a
,
b
).
Since L(x, ) is the sum of three functions, each involving only one of the variables x
i
, to maximize
this sum, the value of x
1
maximizes
u(x
1
) (
a
+
b
)x
1
.
Similarly, x
2
maximizes
u(x
2
)
a
x
2
and x
3
maximizes
u(x
3
)
b
x
3
.
The interpretation is that each link j = a, b charges the user of each ow a price
j
per unit rate.
Thus, the user of ow 1 pays x
1
(
a
+
b
) because that ow goes through the two links. That user
chooses x
1
to maximize the net utility u(x
1
) (
a
+
b
)x
1
. Similar considerations apply to ows 2
and 3.
The key observation is that the maximization of Lis decomposed into a separate maximization
for each user. The coupling of the variables x in the original problem (8.2) is due to the constraints.
The maximization of L is unconstrained and decomposes into separate problems for each of the
variables. That decomposition happens because the constraints are linear in the ows, so that each
variable x
i
appears in a different term in the sum L. Note also that the price that each link charges
is the same for all its ows because the constraints involve the sum of the rates of the ows. That is,
the ows are indistinguishable in the constraints.
To nd the prices
j
(n +1) =
j
(n)
d
d
j
L(x(), ) =
j
(n) +g
j
(x)
where x is the vector of rates at step n of the algorithm and is a parameter that controls the step size
of the algorithm. This expression says that, at each step n, the algorithm adjusts in the direction
opposite to the gradient of the function L(x(), ), which is the direction of steepest descent of the
function. For the new value of the prices, the users adjust their rates so that they approach x() for
the new prices. That is, at step n user 1 calculates the rate x
1
(n) as follows:
x
1
(n) maximizes u(x
1
) (
a
(n) +
b
(n))x
1
,
and similarly for x
2
(n) and x
3
(n).
Figure 8.3 illustrates the gradient algorithm. The gure shows that the algorithm adjusts x
x
L(x, )
(x*, *)
Figure 8.3: The gradient algorithm for the dual problem.
in the direction of the gradient of L(x, ) with respect to x, then in the direction opposite to the
gradient of L(x, ) with respect to . This algorithm searches for the saddle point indicated by a star
in the gure. The saddle point is the minimum over of the maximum over x of L(x, ).
To see how the nodes can calculate the prices, consider the queue length q
j
(t ) at node j at
time t . This queue length increases at a rate equal to the difference between the total arrival rate of
the ows minus the service rate. Thus, over a small interval ,
q
j
((n +1)) = q
j
(n) +g
j
(x)
8.3. DYNAMICROUTINGANDCONGESTIONCONTROL 107
where x is the vector of rates at time n. Comparing the expressions for
j
(n +1) and q
j
((n +1)),
we see that if the steps of the gradient algorithm for
j
are executed every seconds, then
j
(/)q
j
.
That is, the price of the link should be chosen proportional to the queue length at that link. The
intuition should be clear: if queue j builds up, then the link should become more expensive to force
the users to reduce their rate. Conversely, if queue j decreases, then the price of link j should decrease
to encourage the users to increase their usage of that link.
The TCP algorithm does something qualitatively similar to the algorithm discussed above.
To see this, think of the loss rate experienced by a ow as the total price for the ow. This price is
the sum of the prices (loss rates) of the different links. The loss rate increases with the the queue
length, roughly linearly if one uses Random Early Detection (RED) which probabilistically drops
packets when the queue length becomes large. A source slows down when the price increases, which
corresponds to readjusting the rate x to maximize u(x) px when the price p increases.
8.3 DYNAMICROUTINGANDCONGESTIONCONTROL
This section illustrates how the utility maximization framework can naturally lead to distributed
protocols in different layers.
The network in Figure 8.4 shows four nodes and their backlog X
1
, . . . , X
4
. Two ows arrive
at the network with rates R
a
and R
b
that their sources control. Node 1 sends bits to node 2 with rate
S
12
and to node 3 with rate S
13
. The sum of these rates must be less than the transmission rate C
1
of node 1. Similarly, the rate S
24
from node 2 to node 4 cannot exceed the rate C
2
of the transmitter
at node 2 and S
34
cannot exceed C
3
.
2
3
1 4
S
12
S
13
S
24
S
34
C
4
R
a
R
b
X
1
X
2
X
3
X
4
Figure 8.4: A network with two ows.
The objective is to maximize the sum of the utilities
u
a
(R
a
) +u
b
(R
b
)
108 8. MODELS
of the ows while preventing the backlogs fromincreasing excessively where the functions u
a
and u
b
are increasing and concave, reecting the fact that users derive more utility from higher transmission
rates, but with diminishing returns in the sense that an additional unit of rate is less valuable when
the rate is large than when it is small. For instance, one might have
u
j
(x) = k
j
1
1
x
1
, j = a, b,
for some > 0 with = 1 and k
j
> 0.
To balance the maximization of the utilities while keeping the backlogs small, we choose the
rate (R
a
, R
b
, S
12
, S
13
, S
24
, S
34
) to maximize, for some > 0,
:= [u
a
(R
a
) +u
b
(R
b
)]
d
dt
[
1
2
4
i=1
X
2
i
(t )].
To maximize this expression, one chooses rates that provide a large utility u
a
(R
a
) +u
b
(R
b
) and also
a large decrease of the sum of the squares of the backlogs. The parameter determines the tradeoff
between large utility and large backlogs. If is large, then one weighs the utility more than the
backlogs.
Now,
d
dt
[
1
2
X
2
2
(t )] = X
2
(t )
d
dt
X
2
(t ) = X
2
(t )[R
b
+S
12
S
24
]
because the rate of change of X
2
(t ) is the arrival rate R
b
+S
12
into node 2 minus the departure rate
S
24
from that node. The terms for the other backlogs are similar. Putting all this together, we nd
= [u
a
(R
a
) +u
b
(R
b
)] X
1
[R
a
S
12
S
13
] X
2
[R
b
+S
12
S
24
]
X
3
[S
13
S
34
] X
4
[S
24
+S
34
C
4
].
We rearrange the terms of this expression as follows:
= [u
a
(R
a
) X
1
R
a
] +[u
b
(R
b
) X
2
R
b
]
+S
12
[X
1
X
2
] +S
13
[X
1
X
3
] +S
24
[X
2
X
4
] +S
34
[X
3
X
4
] +C
4
X
4
.
The maximization is easy if one observes that the different terms involve distinct variables.
Observe that the last term does not involve any decision variable. One nds the following:
R
a
maximizes u
a
(R
a
) X
1
R
a
R
b
maximizes u
b
(R
b
) X
2
R
b
S
12
= C
1
1{X
1
> X
2
and X
2
< X
3
}
S
13
= C
1
1{X
1
> X
3
and X
3
< X
2
}
S
24
= C
2
1{X
2
> X
4
}
S
34
= C
3
1{X
3
> X
4
}.
8.4. APPENDIX: JUSTIFICATIONFORPRIMAL-DUALTHEOREM 109
Thus, the source of owa adjusts its rate R
a
basedonthe backlog X
1
innode 1, andsimilarly for
ow b. The concavity of the functions u
j
implies that R
a
decreases with X
1
and R
b
with X
2
. For the
specic forms of the functions indicated above, one nds R
a
= [k
a
/X
1
]
1/
and R
b
= [k
b
/X
2
]
1/
.
Node 1 sends the packets to the next node (2 or 3) with the smallest backlog, provided that that
backlog is less than X
1
; otherwise, node 1 stops sending packets. Node 2 sends packets as long as
its backlog exceeds that of the next node and similarly for node 3.
Since the rates R
a
and R
b
go to zero as X
1
and X
2
increase, one can expect this scheme to
make the network stable, and this can be proved. In a real network, there are delays in the feedback
about downstream backlogs and the nodes send packets instead of bits. Nevertheless, it is not hard
to show that these effects do not affect the stability of the network.
This mechanism is called a backpressure algorithm. Note that the congestion control is based
only on the next node. This contrasts with TCP that uses loss signals from all the nodes along the
path. Also, the routing is based on the backlogs in the nodes. Interestingly, the control at each node
is based only on local information about the possible next nodes downstream. This information can
be piggy-backed on the reverse trafc. Thus, this algorithm can be easily implemented in each node
without global knowledge of the whole network.
8.4 APPENDIX: JUSTIFICATIONFORPRIMAL-DUAL
THEOREM
In this section, we justify Theorem 8.1.
Consider the problem 8.3 and assume that the solution is x
. Assume that g
a
(x
) = 0 and
g
b
(x
in the direction of
the gradient f (x
) or g
b
(x
).
This condition is satised in the left part of the Figure 8.5, but not in the right part, where x
increases f beyond f (x
) without increasing g
a
nor g
b
. The gure shows that the condition implies
that f (x
) and g
b
(x
).
Thus, a necessary condition for optimality of x
is that
f (x
) =
a
g
a
(x
) +
b
g
b
(x
)
for some
a
0 and
b
0 and, moreover,
a
g
a
(x
) = 0 and
b
g
b
(x
) = 0.
[Note that this argument is not valid if the gradients g
a
(x
) and g
b
(x
D(
1
)
L(x,)
1
L(x(
1
),)
f(x(
1
))
f(x(
*
))
*
D(
*
) D(
2
)
2
f(x(
2
))
g(x()) < 0
g(x()) > 0
g(x()) = 0
Figure 8.6: The optimal Lagrange multiplier minimizes D().
convex. For simplicity of the gure, we have assumed that there is only one constraint g(x) 0 and
is the corresponding Lagrange multiplier. Note that if x() maximizes L(x, ) where
L(x, ) = f (x) g(x),
then
x
L(x, ) = 0 for x = x().
Consequently,
d
d
L(x(), ) =
x
L(x(), )
dx()
d
+
L(x(), ) = g(x()).
8.5. SUMMARY 111
Using this observation, the gure shows that the largest value f (x(
.
.
C
B
X
t
Figure 10.2: Buffer with shaped input trafc.
has entered the buffer at time t , then the system serves that packet before the packets that have not
completely entered the buffer at that time. We also assume that each packet has at most P bytes and
enters the buffer contiguously. That is, bytes of different packets are not interleaved on any input
link. Let X
t
indicate the number of bytes in the buffer at time t . One has the following result.
Theorem10.1
Assume that a
1
+ +a
N
C. Then
(a) One has
X
t
B
1
+ +B
N
+NP, t 0.
(b) The delay of every packet is bounded by (B
1
+ +B
N
+NP +P)/C.
Proof: (a) Assume that X
t
> B
1
+. . . +B
N
+NP and let u be the last time before t that X
u
= NP.
During [u, t ], the system contains always at least one full packet and serves bytes constantly at rate
C. Indeed, each packet has at most P bytes, so it is not possible to have only fragments of packets
that add up to NP bytes. Because of the leaky buckets, at most B
n
+a
n
(t u) bytes of enter the
buffer in [u, t ] from input n. Thus,
X
t
X
u
+
N
n=1
[B
n
+a
n
(t u)] C(t u) B
1
+ +B
N
+NP,
a contradiction.
(b) Consider a packet that completes its arrival into the buffer at time t . The maximum delay
for that packet would arise if all the X
t
bytes already in the buffer at time t were served before the
packet. In that case, the delay of the packet before transmission would be X
t
/C. The packet would
then be completely transmitted at the latest after (X
t
+P)/C. Combining this fact with the bound
on X
t
gives the result.
10.3 SCHEDULING
Weighted Fair Queuing (WFQ) is a scheduling mechanism that controls the sharing of one link
among packets of different classes. We explain that this mechanism provided delay guarantees to
regulated ows. Both the denition of WFQ and its analysis are based on an idealized version of
the scheme called Generalized Processor Sharing (GPS). We start by explaining GPS.
128 10. QOS
10.3.1 GPS
.
.
w
K
w
1
w
2
Class 1
Class K
A
1
(t)
A
2
(t)
A
K
(t)
D
1
(t)
D
2
(t)
D
K
(t)
A
1
(t)
D
1
(t)
A
2
(t)
D
2
(t)
1 2 3 4 6 8 9 t
P
1
P
2
P
3
P
4
Figure 10.3: Generalized Processor Sharing.
Figure 10.3 illustrates a GPS system. The packets are classied into K classes and wait in
corresponding rst-in-rst-out queues until the router can transmit them. Each class k has a weight
w
k
. The scheduler serves the head of line packets at rates proportional to weight of their class. That
is, the instantaneous service rate of class k is w
k
C/W where C is the line rate out of the router and
W is the sum of the weights of the queues that are backlogged at time t . Note that this model is a
mathematical ction that is not implementable since the scheduler mixes bits from different packets
and does not respect packet boundaries.
We draw the timing diagram on the right of the gure assuming that only two classes (1 and
2) have packets and with w
1
= w
2
. A packet P
1
of class 2 arrives at time t = 1 and is served at rate
one until time t = 2 when packet P
2
of class 1 enters the queue. During the interval of time [2, 4],
the scheduler serves the residual bits of P
1
and the bits of P
2
with rate 1/2 each.
From this denition of GPS, one sees that the scheduler serves class k at a rate that is always
at least equal to w
k
C/(
j
w
j
). This minimum rate occurs when all the classes are backlogged.
Theorem 10.2 Assume that the trafc of class k is regulated with parameters (a
k
, B
k
) such that
k
:=
w
k
C/(
j
w
j
) a
k
. Then the backlog of class k never exceeds B
k
and its queuing delay never exceeds
B
k
/
k
.
Proof: The proof is similar to that of Theorem 10.1. Assume that the backlog X
t
(in bytes) of class
k exceeds B
k
and let u be the last time before t that X
u
= 0. During [u, t ], the scheduler serves at
least
k
[t u] bytes of class k and at most a
k
(t u) +B
k
arrive. Consequently,
X
t
X
u
+a
k
(t u) +B
k
k
(t u) B
k
since a
k
k
. This is a contradiction that shows that X
t
can never exceed B
k
.
For the delay, consider a bit of class k that enters the buffer at time t when the backlog of
class k is X
t
. Since the class k buffer is served at least at rate
k
, the time until that bit leaves cannot
exceed t +X
t
/
k
. Consequently, the delay cannot exceed B
k
/
k
.
10.3. SCHEDULING 129
10.3.2 WFQ
As we mentioned, GPS is not implementable. Weighted Fair Queuing approximates GPS. WFQ
is dened as follows. The packets are classied and queued as in GPS. The scheduler transmits one
packet at a time, at the line rate. Whenever it completes a packet transmission, the scheduler starts
transmitting the packet that GPS would complete transmitting rst among the remaining packets.
For instance, in the case of the gure, the WFQ scheduler transmits packet P
1
during [1, 3.5], then
starts transmitting packet P
2
, the only other packet in the system. The transmission of P
2
completes
at time 4.5. At that time, WFQ starts transmitting P
3
, and so on.
The gure shows that the completion times of the packets P
1
, . . . , P
4
under GPS are
G
1
= 4.5, G
2
= 4, G
3
= 8, G
4
= 9, respectively. You can check that the completion times of these
four packets under WFQ are F
1
= 3.5, F
2
= 4.5, F
3
= 7, and F
4
= 9. Thus, in this example, the
completion of packet P
2
is delayed by 0.5 under WFQ. It seems quite complicated to predict by
how much a completion time is delayed. However, we have the following simple result.
Theorem 10.3 Let F
k
and G
k
designate the completion times of packet P
k
under WFQ and GPS,
respectively, for k 1. Assume that the transmission times of all the packets are at most equal to T . Then
F
k
G
k
+T, k 1. (10.1)
Proof: Note that GPS and WFQ are work-conserving: they serve bits at the same rate whenever
they have bits to serve. Consequently, the GPS and WFQ systems always contain the same total
number of bits. It follows that they have the same busy periods (intervals of time when they are not
empty). Consequently, it sufces to show the result for one busy period.
Let S
i
be the arrival time of packet P
i
.
Assume F
1
< F
2
< < F
K
correspond to packets within one given busy period that starts
at time 0, say.
Pick any k {1, 2, . . . , K}. If G
n
G
k
for n = 1, 2, . . . , k 1, then during the interval
[0, G
k
], the GPS scheduler could serve the packets P
1
, . . . , P
k
, so that G
k
is larger than the sum of
the transmission times of these packets under WFQ, which is F
k
. Hence G
k
F
k
, so that (10.1)
holds.
Now assume that G
n
> G
k
for some 1 n k 1 and let m be the largest such value of n,
so that
G
n
G
k
< G
m
, for m < n < k.
This implies that the packets P := {P
m+1
, P
m+2
, . . . , P
k1
} must have arrived after the start of
service S
m
= F
m
T
m
of packet m, where T
m
designates the transmission time of that packet. To
see this, assume that one such packet, say P
n
, arrives before S
m
. Let G
m
and G
n
be the service times
under GPS assuming no arrivals after time S
m
. Since P
m
and P
n
get served in the same proportions
130 10. QOS
until one of them leaves, it must be that G
n
< G
m
, so that P
m
could not be scheduled before P
n
at
time S
m
.
Hence, all the packets P arrive after time S
m
and are served before P
k
under GPS. Conse-
quently, during the interval [S
m
, G
k
], GPS serves the packets {P
m+1
, P
m+2
, . . . , P
k
}. This implies
that the duration of that interval exceeds the sum of the transmission times of these packets, so that
G
k
(F
m
T
m
) T
m+1
+T
m+2
+ +T
k
,
and consequently,
G
k
F
m
+T
m+1
+ +T
k
T
m
= F
k
T
m
,
which implies (10.1).
10.4 REGULATEDFLOWS ANDWFQ
Consider a stream of packets regulated by a token bucket and that arrives at a WFQ scheduler, as
shown in Figure 10.4.
a tokens/s
1 token/bit
up to B tokens
Token Counter
Packet Buffer
.
.
w
K
w
w
2
Class K
D
2
(t)
D
K
(t)
C
Figure 10.4: Regulated trafc and WFQ scheduler.
We have the following result.
Theorem 10.4 Assume that a < := wC/W where W is the sum of the scheduler weights. Then the
maximum queueing delay per packet is
B
+
L
C
where L is the maximum number of bits in a packet.
Proof: This result is a direct consequence of Theorems 10.2-10.3.
10.5. END-TO-ENDQOS 131
10.5 END-TO-ENDQOS
Is it possible to implement differentiated services without modifying the routers? Some researchers
have suggested that this might be feasible.
Consider two applications, voice (telephone conversation) and data (downloading a web page
with many pictures).
Say that the two applications use TCP. In the regular implementation of TCP, it may happen
that voice gets a bit rate that is not sufcient. Consider the following setup.
First, we ask the routers to mark packets when their buffer gets half-full, instead of dropping
packets when they get full. Moreover, the destination marks the ACKs that correspond to marked
packets. (This scheme is called explicit congestion notication, ECN.) Normally, a TCP source
should divide its window size by 2 when it gets a marked ACK. Assume instead that an application
can maintain its normal window size but that the user has to pay whenever it gets a marked ACK.
In this way, if the user does not care about the speed of a connection, it asks its TCP to slow down
whenever it receives a marked ACK. However, if the user is willing to pay for the marks, it does not
slow down.
The practical questions concern setting up an infrastructure where your ISP could send you
the bill for your marks, how you could verify that the charges are legitimate, whether users would
accept variable monthly bills, how to setup preset limits for these charges, the role of competition,
and so on.
10.6 END-TO-ENDADMISSIONCONTROL
Imagine a number of users that try to use the Internet to place telephone calls (voice over IP). Say
that these calls are not acceptable if the connection rate is less than 40Kbps. Thus, whereas TCP
throttles down the rate of the connections to avoid buffer overows, this scheme might result in
unacceptable telephone calls.
Asimple end-to-end admission control scheme solves that problem. It works as follows. When
a user places a telephone call, for the rst second, the end devices monitor the number of packets
that routers mark using ECN. If that number exceeds some threshold, the end devices abort the call.
The scheme is such that calls that are not dropped are not disturbed by attempted calls since ECN
marks packets early enough. Current VoIP applications do not use such a mechanism.
One slight objection to the scheme is that once calls are accepted, they might hog the network.
One could modify the scheme to force a new admission phase every two minute or so.
10.7 NETNEUTRALITY
Network Neutrality is a contentious subject. Strict neutrality prescribes that the network must treat
all the packets equally. That is, routers cannot classify packets or apply any type of differentiated
service.
132 10. QOS
Advocates of an open Internet argue that neutrality is essential to prevent big corporations
from controlling the services that Internet provides. They claim that a lack of neutrality would limit
freedom of speech and the public good value of the Internet for users who cannot pay for better
services. For instance, one can imagine that some ISPs that provide Internet access and phone services
might want to disable or delay VoIP trafc to force the users to pay for the phone service. One can
also imagine that some ISPs might provide preferential access to content providers with which that
have special business relationships.
Opponents of neutrality observe that it prohibits service differentiation which is clearly bene-
cial for users. Moreover, being unable to charge for better services, the ISPs have a reduced incentive
to upgrade their network. Similarly, content providers may not provide high-bandwidth content if
the ISPs cannot deliver it reliably. It may be that most customers end up losing because of neutrality.
The dynamics of markets for Internet services are complex. No one can forecast the ultimate
consequences of neutrality regulations. One reasonable approachmight be to try to designregulations
that enable service differentiation while guaranteeing that some fraction of the network capacity
remains open and neutral.
10.8 SUMMARY
The chapter discusses the quality of services in networks.
Applications have widely different delay and throughput requirements. So, treating packets
differently should improve some applications with little effect on others.
One approach to guarantee delays to packet streams in the network is to shape the trafc with
leaky buckets and to guarantee a minimum service rate with an appropriate scheduler such as
WFQ.
WFQ serves packets in the order they would depart under an idealized generalized processor
sharing scheduler. We derive the delay properties of WFQ.
An approach to end-to-end QoS would consist in charging for marked ACKs (either explicitly
or implicitly by slowing down).
An end-to-end admission control scheme can guarantee the quality of some connections with
tight requirements.
Should QoS be authorized? This question is the topic of network neutrality debates and
proposed legislation.
10.9 PROBLEMS
P10.1 For the purpose of this problem, you can consider the idealized Generalized Processor Sharing
(GPS) as being equivalent to Weighted Fair Queueing (WFQ).
10.9. PROBLEMS 133
Figure 10.5: Figure for QoS Problem 1.
Consider a system of four queues being serviced according to a WFQ scheduling policy as
shown in the gure. The weights given to the four queues (A, B, C, D) are 4, 1, 3, and 2,
respectively. They are being serviced by a processor at the rate of 10 Mbps.
The table below gives a list of different input trafc rates (in Mbps) at the four input queues.
Fill in the resultant output rates for each these four queues. We have lled in the rst two rows
to get you started!
Table 10.2: Table for QoS Problem 2.
INPUT RATES OUTPUT RATES
A B C D A B C D
1 1 1 1 1 1 1 1
10 10 10 10 4 1 3 2
6 6 2 2
8 0 0 8
1 5 3 5
P10.2 Here we analyze the case of two applications sharing a 2 Mbps link as shown in the gure. All
packets are 1000 bits long. Application A is video trafc which arrives at the link at a constant
rate of 1 packet every millisecond. Application B is highly bursty. Its packets arrive as a burst
of 10 Mbits worth of packets every 10 seconds. The incoming rate of the links carrying trafc
from A and B is 10 Mbps each (this limits the peak rate at which packets enter the buffer for
the link).
(a) The scheduler at the link is First Come First Serve (FCFS). If a packet of each of the
applications arrives simultaneously, the packet of application A goes into the queue rst
(i.e. it is served rst under FCFS). Calculate the following for Application A:
134 10. QOS
Figure 10.6: Figure for QoS Problem 2.
i. The maximum packet delay.
ii. Assume that this is the only link traversed by application A. What should the size
of the playout buffer be at the receiver?
(b) Calculate the quantities above (in (a)i and (a)ii) for the case when the scheduler is round
robin.
10.10 REFERENCES
The generalized processor sharing scheduling was invented by Kleinrock and Muntz (28) and
Weighted Fair Queuing by Nagle (36). The relationship between GPS and WFQ was derived
by A. Parekh and Gallager (39) who also analyzed the delays of trafc regulated by leaky buckets.
The possibility of bounding delays by regulating ows and using WFQ scheduling suggest a pro-
tocol where users request a minimum bandwidth along a path and promise to regulate their ows.
Such protocols were developed under the name of RsVP (for reservation protocol) and IntServ (for
integrated services). The readers should consult the RFCs for descriptions of these protocols. End-
to-end admission control was proposed by Gibbens and Kelly in (18). Wikipedia is an informative
source for net neutrality and the debates surrounding that issue.
135
C H A P T E R 11
Physical Layer
The most basic operation of a network is to transport bits. Communication links perform that task
and form the Physical Layer of the network.
In this chapter, we describe the most common communication links that network use and
explain briey howthey work. Our goal is to provide a minimumunderstanding of the characteristics
of these links. In particular, we discuss the following topics:
Wired links such as DSL and Ethernet;
Wireless links such as those of Wi-Fi networks;
Optical links.
11.1 HOWTOTRANSPORTBITS?
Adigital communicationlink is a device that encodes bits into signals that propagate as electromagnetic
waves and recovers the bits at the other end. (Acoustic waves are also used in specialized situations,
such as under water.) A simple digital link consists of a ash light that sends messages encoded with
the Morse code: a few short pulses of light dashes and dots encode each letter of the message
and these light pulses propagate. (A similar scheme, known as the optical semaphore, was used in
the late 18th and early nineteenth century and was the precursor of the electrical telegraph.) The
light travels at about c = 3 10
8
m/s, so the rst pulse gets to a destination that is L meters away
from the source after L/c seconds. A message that has B letters takes about B/R seconds to be
transmitted, where R is the average number of letters that you can send per second. Accordingly,
the message reaches the destination after L/c +B/R seconds. The rate R depends on how fast you
are with the ash light switch; this rate R is independent of the other parameters of the system.
There is a practical limit to how large R can be, due to the time it takes to manipulate the switch.
There is also a theoretical limit: if one were to go too fast, it would become difcult to distinguish
dashes, dots and the gaps that separate them because the amount of light in each dot and dash
would be to faint to distinguish from background light. That is, the system would become prone to
errors if one were to increase R beyond a certain limit. Although one could improve the reliability
by adding some redundancy to the message, this redundancy would tend to reduce R. Accordingly,
one may suspect that there is some limit to how fast one can transmit reliably using this optical
communication scheme.
136 11. PHYSICAL LAYER
Similar observations apply to every digital link: 1) The link is characterized by a transmission
rate of R bits per second; 2) the signal propagates through the physical medium at some speed v
that is typically a fraction of the speed of light c. Moreover, the link has some bit error rate BER
equal to the fraction of bits that do not reach the destination correctly. Finally, the link has a speed
limit: it cannot transmit bits reliably faster than its Shannon capacity C, even using the fanciest error
correction scheme conceivable. (See Section 2.2.2.)
11.2 LINKCHARACTERISTICS
Figure 11.1 summarizes the best characteristics of three link technologies: wired (wire pairs or cable),
wireless, and optical bers. The characteristics shown are the maximum distance for a given bit rate,
for practical systems. As the gure shows, optical bers are capable of very large transmission rates
over long distances. The Internet uses optical links at 10Gbps (1 Gbps = 1 gigabit per second = 10
9
bits per second) over 80km.
Wireless links such as used by WiFi achieve tens of Mbps over up to a hundred meters. A
cellular phone links transmits at about 10 Mbps over a few kilometers, and similarly for a WiMAX
link.
Wired links of a fast Ethernet network transmits at 100Mbps over up to 110m. A DSL link
can achieve a few Mbps over up to 5km and about 10Mbps over shorter distances. A cable link can
transmit at about 10 Mbps over 1km.
0.1 1 10 100
1m
10m
100m
1km
10km
100km
Distance
Rate (Mps)
1000 10,000
Wired
Optical
Wireless
Figure 11.1: Approximate characteristics of links.
11.3 WIREDANDWIRELESS LINKS
A wireless link uses a radio transmitter that modulates the bits into electrical signals that an an-
tenna radiates as electromagnetic waves. The receiver is equipped with an antenna that captures the
electromagnetic waves. An amplier increases the voltage of the received signals, and some circuitry
11.3. WIREDANDWIRELESS LINKS 137
detects the most likely bits that the transmitter sent based on the received signals. The signal must
have a frequency high enough for the antenna to be able to radiate the wave.
A wired link, over coaxial cables or wire pairs, uses very similar principles. The difference is
that the link can transmit much lower frequencies than a wireless link. For instance, the power grid
carries signals at 60Hz quite effectively. On the other hand, a wired link cannot transmit very high
frequencies because of the skin effect in conductors: at high frequencies, the energy concentrates at
the periphery of the conductor so that only a small section carries the current, which increases the
effective resistivity and attenuates the signal quickly as it propagates. Another important difference
between wired and wireless links is that wireless transmissions may interfere with one another if they
are at the same frequency and are received by the same antenna. Links made up of non-twisted wire
pairs interfere with each other if they are in close proximity, as in a bundle of telephone lines. Twisted
pairs interfere less because electromagnetic waves induce opposite currents in alternate loops.
11.3.1 MODULATIONSCHEMES: BPSK, QPSK, QAM
Consider a transmitter that sends a sine wave with frequency 100MHz of the form S
0
(t ) =
cos(2f
0
t ) for a short interval of T seconds when it sends a 0 and the signal S
1
(t ) = cos(2f
0
t )
for T seconds when it sends a 1, where f
0
= 100MHz. The receiver gets the signal and must decide
whether it is more likely to be S
0
(t ) or S
1
(t ). The problem would be fairly simple if the received
signal were not corrupted by noise. However, the signal that the receiver gets may be so noisy that
it can be difcult to recognize the original signal.
The transmitters antenna is effective at radiating electromagnetic waves only if their wave-
length is approximately four times the length of the antenna (this is the case for a standard rod
antenna). Recall that the wavelength of an electromagnetic wave with frequency f is equal to the
speed of light c divided by f . For instance, for f =100MHz, one nds that the wavelength is
3 10
8
(m/s)/10
8
s
1
= 3 meters. Thus, the antenna should have a length of about 0.75 meter, or
two and a half feet. A cell phone with a antenna that is ten times smaller should transmit sine waves
with a frequency ten times larger, or about 1GHz.
Howdoes the receiver tell whether it received S
0
(t ) or S
1
(t ) = S
0
(t )? The standard method
is to multiply the received signal by a locally generated sine wave L(t ) = cos(2f
0
t ) during the
interval of T second and to see if the average value of the product is positive or negative. Indeed,
L(t ) S
0
(t ) 0 whereas L(t ) S
1
(t ) 0. Thus, if the average value of the product is positive, it
is more likely, even in the presence of noise, that the receiver was listening to S
0
(t ) than to S
1
(t ).
It is essential for the locally generated sine wave cos(2f
0
t ) to have the same frequency f
0
as the
transmitted signal. Moreover, the timing of that sinewave must agree with that of the received signal.
To match the frequency, the receiver uses a special circuit called a phase-locked loop. To match the
timing (the phase), the receiver uses the preamble in the bit stream that the transmitter sends at
the start of a packet. This preamble is a known bit pattern so that the receiver can tell whether it is
mistaking zeros for ones.
138 11. PHYSICAL LAYER
x
2
T
T
0
()dt
x
2
T
T
0
()dt
cos(2f
0
t)
sin(2f
0
t)
a cos(2f
0
t)
+ b sin(2f
0
t)
a
b
a a a
b b b
Figure 11.2: Top: BPSK (left), QPSK (middle), and QAM (right); Bottom: Demodulation.
Using this approach, called Binary Phase Shift Keying (BPSK) and illustrated in Fig-
ure 11.2, the transmitter sends one bit (0 or 1) every T seconds. Another approach, called
Quadrature Phase Shift Keying (QPSK) enables the transmitter to send two bits every T sec-
onds. QPSK works as follows. During T seconds, the transmitter sends a signal a cos(2f
0
t ) +
b sin(2f
0
t ). The transmitter chooses the coefcients a and b among four possible pairs of values
{(1, 1), (1, 1), (1, 1), (1, 1)} that correspond to the four possible pairs of bits {00, 01, 10, 11}
(see Figure 11.2). Thus, to send the bits 00, the transmitter sends the signal cos(2f
0
t )
sin(2t
0
t ) for T seconds, and similarly for the other pairs of bits. To recover the coefcient a,
the receiver multiplies the received signal by 2 cos(2f
0
t ) and computes the average value of the
product during T seconds. This average value is equal to a. Indeed,
M(T ) :=
T
0
2 cos(2f
0
t )[a cos(2f
0
t ) +b sin(2f
0
t )]dt
= a
T
0
2 cos
2
(2f
0
t )dt +b
T
0
2 sin(2f
0
t ) cos(2f
0
t )]dt
= a
T
0
[1 +cos(4f
0
t )]dt +b
T
0
[sin(4f
0
t )]dt.
.
It follows that the average value of the product is given by
1
T
M(T ) a
11.3. WIREDANDWIRELESS LINKS 139
for T not too small.This demodulationprocedure is showninFigure 11.2.The value of the coefcient
b is recovered in a similar way.
One can extend this idea to a larger set of coefcients (a, b). The quadrature amplitude
modulation (QAM) scheme shown in Figure 11.2 chooses a set of 2
n
coefcients (a, b) evenly
spaced in [1, 1]
2
. Each choice is then transmitted for T seconds and corresponds to transmitting n
bits. Of course, as one increases n, the spacing between the coefcients gets smaller and the system
becomes more susceptible to errors. Thus, there is a tradeoff between the transmission rate (n bits
every T seconds) and the probability of error. The basic idea is that one should use a larger value
of n when the ratio of the power of the received signal over the power of the noise is larger. For
instance, in the DSL (digital subscriber loop) system, the set of frequencies that the telephone line
can transmit is divided into small intervals. In each interval, the system uses QAM with a value of
n that depends on the measured noise power.
11.3.2 INTER-CELL INTERFERENCEANDOFDM
In a cellular system, space is divided into cells. The users in one cell communicate via the base station
of that cell. Assume that the operator has bought the license for a given range of frequencies, say
from 1GHz to 1.01GHz. There are a number of ways that the operator can allocate that spectrum
to different cells. One extreme way is to split the 10MHz spectrum into disjoint subsets, say seven
subsets of about 1.4MHz each, and to allocate the subsets to the cells in a reuse pattern as shown
in Figure 11.3. The advantage of this approach is that the cells that use the same frequencies are far
5
6
1
4 3
7
2
5
6
1
4 3
7
2
5
6
1
7
2
Figure 11.3: Cellular frequency reuse.
apart and do not interfere with each other. A disadvantage is that the spectrum allocated to each cell
is small, which is wasteful if some cells are not very busy.
A different approach is to let all the cells use the full spectrum. However, to limit interference,
one has to use a special scheme. We describe one such scheme: Orthogonal Frequency Division
140 11. PHYSICAL LAYER
Multiplexing (OFDM). In OFDM, the spectrum is divided into many narrow subcarriers, as shown
in Figure 11.4.
F
1
F
2
F
3
F
4
F
5
T
1
T
2
T
3
T
4
Symbol
Time
Subcarrier
Spacing
Figure 11.4: OFDM subcarriers and symbol times.
During each symbol time, the nodes modulate the subcarriers separately to transmit bits.
A careful design of the subcarrier spacing and the symbol time duration makes the subcarriers
orthogonal to each other. This ensures that the subcarriers do not interfere with each other, and at
the same time, for higher system efciency, they are packed densely in the spectrum without any
frequency guard bands. For example, an OFDM system with 10MHz spectrum can be designed
to have 1000 subcarriers 10KHz apart and the symbol time of 100s. The symbol times are long
compared to the propagation time between the base station and the users in the cell. Consequently,
even though some signals may be reected and take a bit longer to reach a user than other signals,
the energy of the signal in one symbol time hardly spills into a subsequent symbol time. That is,
there is very little inter-symbol interference.
OFDM requires a single user to utilize the whole spectrum at any given time. It accom-
modates sharing between different users by time division multiplexing, i.e., allowing them to use
the full spectrum during different symbol times. Orthogonal Frequency Division Multiplexing Access
(OFDMA) is an evolution of OFDM where different users are allowed to share the spectrum at the
same time. For this, OFDMA creates subchannels by grouping a number of subcarriers, and makes
use of subchannels and symbol times to share the spectrum among the users by deploying both
frequency and time division multiplexing. Note that the frequency division multiplexing is done in
the granularity of subchannels. As discussed in the WiMAX-LTE chapter, both WiMAX and LTE
make use of the OFDMA technology. To x the idea, in the numerical example above, there can be
30 subchannels of 24 subcarriers each for user data with some subcarriers left over for pilots, etc.
1
1
The terminology we have used to explain the OFDM and OFDMA concepts are from mobile WiMAX, and the numerical
example here is close to one of its downlink congurations. LTE is based on similar concepts, but uses a different terminology.
See the WiMAX-LTE chapter for further information.
11.4. OPTICAL LINKS 141
What about interference between different cells? The trick to limit such interference is to
allocate the different subcarriers to users in a pseudo-random way in time and frequency with the
latter being possible only if OFDMA is used. With such a scheme, one user interferes with another
only a small fraction of the time, and the effect of such interference can be mitigated by using some
error correction codes. One advantage is that if a cell is not very busy, it automatically creates less
interference for its neighboring cells.
11.4 OPTICAL LINKS
The magic of optical bers is that they can carry light pulses over enormous distances (say 100km)
with very little attenuation and little dispersion. Low attenuation means that the power of light after
a long distance is still strong enough to detect the pulses. Low dispersion means that the pulses do
not spread much as they travel down the ber. Thus, pulses of light that are separated by gaps of a
fraction of a nanosecond are still separated after a long distance.
11.4.1 OPERATIONOF FIBER
The simplest optical ber is called a step-index ber, as shown in Figure 11.5. Such a ber is made
of a cylindrical core surrounded by a material with a lower refractive index. If the incident angle of
light at the boundary between the two materials is shallow enough (as in the gure), the light beam
is totally reected by the boundary. Thus, light rays propagate along the ber in a zig-zag path. Note
that rays with different angles with the axis of the ber propagate with different speeds along the
ber because their paths have different lengths. Consequently, some of the energy of light travels
faster that the rest of the energy, which results in a dispersion of the pulses. This dispersion limits
the rate at which one can generate pulses if one wishes to detect them after they go through the
ber. Since the dispersion increases linearly with the length of the ber, one nds that the duration
of pulses must be proportional to the length of the ber. Equivalently, the rate at which one sends
the pulses must decrease in inverse proportion with the length of the ber.
Total reection
Figure 11.5: Step index ber (top) and single-mode ber (bottom).
The bottom part of Figure 11.5 shows a single-mode ber. The geometry is similar to that
of the step-index ber. The difference is that the core of the single-mode ber has a very small
142 11. PHYSICAL LAYER
diameter (less than 8 microns). One can show that, quite remarkably, only the rays parallel to the
axis of the ber can propagate. The other rays that are even only slightly askew self-interfere and
disappear. The net result is that all the energy in a light pulse now travels at the same speed and faces
a negligible dispersion. Consequently, one can send must shorter pulses through a single-mode ber
than through a step-index ber.
11.4.2 OOKMODULATION
The basic scheme, called On-Off Keying (OOK) to send bits through an optical ber is to turn a laser
on and off to encode the bits 1 and 0. The receiver has a photodetector that measures the intensity
of the light it receives. If that intensity is larger than some threshold, the receiver declares that it got
a 1; otherwise, it declares a 0. The scheme is shown in Figure 11.6.
To keep the receiver clock synchronized, the transmitter needs to send enough ones. To
appreciate this fact, imagine that the transmitter sends a string of 100 zeroes by turing its laser off
for 100 time T seconds, where T is the duration of every symbol. If the receiver clock is 1% faster
than that of the transmitter, it will think that it got 101 zeroes. If its clock is 1% slower, it will only
get 99 zeroes. To prevent such sensitivity to the clock speeds, actual schemes insert ones in the bit
stream.
The minimum duration of a pulse of light should be such that the energy in the pulse is large
enough to distinguish it from a background noise with a high probability. That is, even when in the
absence of light, the receiver is subject to some noise that appears as if some light had impinged the
photodetector. The source of this noise is the thermal noise in the receiver circuitry. Thus, in the
absence of light, the receiver gets a current that is equivalent to receiving a random amount of light.
In the presence of a light pulse, the receiver gets a random amount of light whose average value is
larger than in the absence of a light pulse. The problem is to detect whether the received random
amount of light has mean
0
or meant
1
>
0
. This decision can be made with a small probability
of error only if
1
0
is large enough. Since
1
decays exponentially with the length of the ber
as it gets divided by 2 every K kms, this condition places a limit on the length of the ber for a
specied probability of error. Note also that
1
0
is the average amount of light that reaches the
receiver from the transmitter and that this quantity is proportional to the duration of a light pulse.
This discussion suggests that the constraint on the length L of the ber and the duration T of a
light pulse must have the following form:
T exp{L} .
11.4.3 WAVELENGTHDIVISIONMULTIPLEXING
Wavelength Division Multiplexing (WDM) is the transmission of multiple signals encoded at differ-
ent wavelengths on the same optical ber. The basic scheme is shown in Figure 11.6. Each bit stream
is encoded as a signal using OOK. Each signal modulates a different laser. The light output of the
lasers are sent through the ber. At the output of the ber, a prismseparates the different wavelengths
11.4. OPTICAL LINKS 143
and sends them to distinct photodetectors. The photodetectors produce a current proportional to
the received light signal. Some detection circuitry then reconstructs the bit stream.
0 1 1 0 1 0 0 1
1 1 0 0 0 1 1 0
0 1 1 0 1 0 0 1
1 1 0 0 0 1 1 0
Fiber
Prism Prism
2
Laser Diodes Photodetectors
Figure 11.6: Wavelength Division Multiplexing. Different signals are encoded at different wavelengths.
Practical system can have up to about 100 different wavelengths. The challenge in building
such systems is to have lasers whose light wavelengths do not change with the temperature of the
lasers. If each laser can be mudulated with a 10Gbps bit stream, such a system can carry 1Tbps (=
10
12
bps), for up to about 100km.
11.4.4 OPTICAL SWITCHING
It is possible to switch optical signals without converting them to electrical signals. Figure 11.7
illustrates a micro electro-mecanical optical switch. This switch consists of small mirrors that can
be steered so that the incident light can be guided to a suitable output ber.
lN-Fibers
OUT-Fibers
MLM5
nirrors
Mirror
Figure 11.7: MEMS optical switch.
Such a switch modies the paths that light rays follow in the network. The network can use
such switches to recover after a ber cut or some failure. It can also adjust the capacity of different
links to adapt to changes in the trafc, say during the course of the day.
Researchers are exploring systems to switch optically bursts of packets. In such a scheme, the
burst is preceded by a packet that indicates the suitable output port of the burst. The switch then
has time to congure the switch before the burst arrives. The challenge in such systems is to deal
144 11. PHYSICAL LAYER
with conicts that arise when different incoming bursts want to go out on the same switch port. An
electronic switch stores packets until the output port is free. In an optical switch, buffering is not
easy and may require sending the burst in a ber loop. Another approach is hot-potato routing, where
the switch sends the burst to any available output ports, hoping that other switches will eventually
route the burst to the correct destination. Packet erasure codes can also be used to make the network
resilient to losses that such contentions create.
11.4.5 PASSIVEOPTICAL NETWORK
A passive optical network (PON) exploits the broadcast capability of optical systems to eliminate the
need for switching. PONs are used as broadband access networks either directly to the user or to a
neighborhood closet from which the distribution uses wire pairs.
a b c
a b c
a b c
a b c
A B C
A B C
A B C
A B C
A
a
B
b
C
c
Figure 11.8: Passive Optical Network.
Figure 11.8 illustrates a passive optical network. The box with the star is a passive optical
splitter that sends the optical signals coming from its left to all the bers to the different subscribers
on the right. Conversely, the splitter merges all the signals that come from the subscribers into the
ber going to the distribution center on the left.
The signals of the different subscribers are separated in time. On the downlink, from the left
to the right, the distribution system sends the signals A, B, C destined to the different subscribers
in a time division multiplexed way. For the uplinks, the subscriber systems time their transmissions
to make sure they do not collide when they get to the distribution center. The network uses special
signals from the distribution center to time the transmissions of the subscriber systems.
11.5 SUMMARY
This chapter is a brief glimpse into the topic of digital communication.
A digital communication link converts bits into signals that propagate as electromagnetic
waves through some medium (free space, wires, cable, optical ber) and converts the signals
back into bits.
11.6. REFERENCES 145
The main characteristic of a link are its bit rate, length, and bit error rate. A wireless or wired
link has a speed limit: its capacity, which depends on the range of frequencies it can transmit
and on the signal-to-noise ratio at the receiver.
Attenuation limits the length of an optical link length and dispersion limits the product of its
length by the bit rate. WDM enables to send different bit stream using different wavelengths.
Optical switching is much simpler than electronic switching. However, because optical buffer-
ing is difcult, it requires new mechanisms to handle contention.
The modulation schemes such as BPSK, QPSK, and QAM achieve a tradeoff between rate
and error protection.
A cellular network can use a frequency reuse pattern. OFDM coupled with interference mit-
igation techniques allows all the cells to use the full spectrum.
11.6 REFERENCES
Any book on digital communication covers the basic material of this section. See, e.g., (40). For
optical networks, see (42).
147
C H A P T E R 12
Additional Topics
The previous chapters explained the main operation and organization principles of networks. In this
chapter, we discuss a number of additional topics.
12.1 OVERLAY NETWORKS
By denition, an overlay network is a set of nodes that are connected by some network technologies
and that cooperate to provide some services. In the Internet, the overlay nodes communicate via
tunnels. Figure 12.1 illustrates such an overlay network. The four nodes A-D are attached to the
network with nodes 1-12. In the gure, nodes u, x, y, z are other hosts.
Node Acan send a packet to Node Dby rst sending it to node Cand having node Cforward
the packet to node D. The procedure is for A to send a packet of the form [A | C | A | D | Packet].
Here, [A | C] designates the IP header for a packet that goes from A to C. Also, [A | D] are the
overlay network source (A) and destination (D). When node C gets this packet, it removes the IP
header and sees the packet [A | D | packet]. Node C then forwards the packet to D as [C | D | A
| D | Packet] by adding the IP header [C | D]. We say that A sends packets to C by an IP tunnel.
Note that the overlay nodes A-D implement their own routing algorithm among themselves. Thus,
the network 1-12 decides how to route the tunnels between pairs of overlay nodes A-D, but the
overlay nodes decide how a packet should go from A to D across the overlay network (A-C-D in
our example). This additional layer of decision enables the overlay network to implement functions
that the underlying network 1-12 cannot.
For instance, the overlay network can implement multicast. Say that x multicasts packets to
u, y, z. One possibility is for x to send the packets to A which sends them to C that then replicates
them and sends one copy to u, one to y, and one to z.
Another possibility is for the overlay network to perform performance-based routing. For
instance, to send packets from y to u, node y can send the packets to B which monitors two paths to
u: one that goes through A and one through C. The overlay network then selects the fastest path.
One benet is that such a scheme can react quickly to problems in the underlying network.
The overlay network can also provide storage or computing services.
12.1.1 APPLICATIONS: CDNANDP2P
We examine two specic examples: Content Distribution Networks (CDN) and Peer-to-Peer Networks
(P2P Networks).
148 12. ADDITIONALTOPICS
1
4
3
2
6
7
10
9
11
8
12
5
A
B
C
D
x
u
y
z
Figure 12.1: Overlay Network: Nodes A, B, C, D form an overlay network.
Content Distribution Network (CDN)
A widely used CDN is Akamai. It is an overlay network of servers. The idea is to serve users from
a nearby CDN server to improve the response time of the system.
Consider once again the network in Figure 12.1. Assume that the nodes A-D are servers that
store copies of some popular web site. When node x wants access to the content of that web site, it
connects to the main server of that site that then redirects it to the closest content server (say A).
Such redirection can happen in a number of ways. In one approach, the DNS server might be able
to provide an IP address (of A) based on the source address of the DNS request (x). In an indirect
approach, x contacts the main server of the web site which then replies with a web page whose links
correspond to server A.
Peer-to-Peer Network (P2P Network)
A P2P network is an overlay network of hosts. That is, end-user devices form a network typically
for the purpose of exchanging les.
The rst example of P2P network was Napster (1999-2001). This network used a central
directory that indicated which hosts had copies of different les. To retrieve a particular le, a user
would contact the directory and get a list of possible hosts to download the le from. After legal
challenges from music producers that Napster was an accomplice to theft, Napster was ordered to
stop its operation.
Subsequent P2P networks did not use a central server. Instead, the P2P hosts participate in a
distributed directory system. Each node has a list of neighbors. To nd a le, a user asks his neighbors
which in turn ask their neighbors. When a host receives a request for a le it has, it informs the
host that originated the request and does not propagate the request any further. Thus, the host that
originates the request eventually gets a list of hosts that have the le and it can choose from which
12.2. HOWPOPULARP2P PROTOCOLS WORK 149
one to get the le. BitTorrent, a popular P2P network, arranges for the requesting host to download
in parallel from up to ve different hosts.
Once a host has downloaded a le, it is supposed to make it available for uploads by other hosts.
In this way, popular les can be downloaded frommany hosts, which makes the systemautomatically
scalable.
12.1.2 ROUTINGINOVERLAY NETWORKS
The nodes of an overlay network typically have no knowledge of the details of the underlying network
topology or capacities. This lack of information may result is serious inefciencies in the routing
across an overlay network.
12.2 HOWPOPULARP2P PROTOCOLS WORK
As of year 2009, statistics (from www.ipoque.com) show roughly 70% of the Internet trafc is from
P2P networks and only 20% is from web browsing. Although there are big variations from continent
to continent, 70% of the P2P trafc is using the BitTorrent protocol and eDonkey contributes
about 20%of the P2P trafc. Gnutella, Ares, DirrectConnect are the next popular P2P contributors.
Although web based multimedia streaming and le hosting appear to be gaining popularity, overall
BitTorrent is still the king of the application layer trafc of the Internet!
Since the substantial amount of trafc is P2P related, a network engineer needs to understand
how P2P protocols work in overlay networks. Proper understanding will help us design a better
network architecture, trafc shaping policy, and pricing policy. In this section, we study how several
popular P2P protocols work from the architectural point of view, but refrain from delineating the
protocol specications.
12.2.1 1STGENERATION: SERVER-CLIENTBASED
UC Berkeley released a project in 1999 that enables distributed computing at networked home
computers. The user side software performs time-frequency signal processing of extraterrestrial
radio signals that is downloaded from a central SETI@HOME server, in the hope of detecting the
existence of intelligent life outside the Earth. SETI@HOME server distributes chunks of data to
participating users to make use of their idle processing power, and the users return the computation
results backtothe server.The message pathof SETI@HOMEis of the Client-Server form. However,
some classify this as the 1st generation of P2P. Strictly speaking, this is distributed grid computing.
12.2.2 2NDGENERATION: CENTRALIZEDDIRECTORY BASED
The 2nd generation of P2P protocol is centralized directory based. Napster received a great deal of
attention as an MP3 le sharing service. Upon startup, a Napster client logs on the central directory
server and noties the server of the list of its shared les. Clients maintain the connection with the
server and use it as a control message path. When a client needs to search and download a le, it
150 12. ADDITIONALTOPICS
issues a query to the server. The server looks up its directory and returns the IP address of the owner
of that le. Then the client initiates a TCP connection to the le owner and starts downloading. In
this generation, the central server played a critical role of providing the search results. Therefore, it
was easy to shut down the entire service by just shutting down the central server. A later version of
Napster allowed client-client browsing.
12.2.3 3RDGENERATION: PURELY DISTRIBUTED
The primary feature of the 3rd generation P2P protocol is its fully distributed operation; it does not
require the presence of centralized server. Instead, every participating host works both as a client
and a server. Gnutella operates in this manner, and has been quite successful. Before starting up,
a client should have a list of candidate IP addresses. Upon startup, it tries to connect to them and
learns the true list of working IP addresses. A search query is ooded over the Gnutella network
over the active connections. A host that has the answer to the query responses along with its IP
address. This scheme can suffer from scalability related issues. The number of connections and the
way of ooding can easily overwhelm the available bandwidth and the processing power. The later
versions of Gnutella thus adopts the notion of super nodes.
12.2.4 ADVENTOF HIERARCHICAL OVERLAY - SUPERNODES
A P2P protocol in the pure form assumes that all hosts are equal. However, from the bandwidth and
processing power considerations, some hosts are superior. By making them as relays for the network,
the scalability issues can be mitigated. They are called super nodes in contrast to the normal leaf
nodes. Introduction of the super nodes brings a hierarchy in the P2P overlay networks. Upon startup,
a client connects to the super nodes instead of connecting to other clients directly. A search query is
propagated via super nodes. Skype, a popular VoIP client, and the later version of Gnutella make use
of the super nodes. Any Skype node with proper conguration can automatically be a super node.
There have been reports that many university based networked computers are serving as super nodes
because of their higher speed Internet connections and openness.
There is the issue of being able to connect to the clients behind the NAT devices. If only the
receiver client is behind a NAT device, the sender client is able to connect to it by issuing a call back
command. When both clients are behind the NAT devices, they use another Skype client residing
in the public IP domain as a relay between the sender and the receiver.
12.2.5 ADVANCEDDISTRIBUTEDFILESHARING: BITTORRENT
BitTorrent contributes a lions share of the entire Internet trafc - it constitutes roughly 50% of
it. Compared to other protocols explained above, the primary difference for BitTorrent lies in the
greater redundancy in le distribution. A specialized web server, named tracker, maintains the list
of the peers that are currently transferring a given le.
1
Any peer that wishes to download the
1
We do not differentiate between a Torrent Tracker and Indexer in our discussion.
12.3. SENSORNETWORKS 151
le rst connects to the tracker and obtains the list of active peers. A single le is split into many
pieces, typically of 16KBytes, and is exchanged between the peers using these pieces. This enables a
downloader to connect to multiple peers and transfer several pieces simultaneously. A unique policy
called rarest-rst is used to select the pieces. The downloader rst downloads the least shared piece
piece, increasing the redundancy of that piece over the network. After that, the downloader is able to
serve as anuploader at the same time. So there is no pure downloader. All participating clients become
peers. In order to discourage free riders who do not contribute to uploading, BitTorrent adopts the
policy known as Tit-for-Tat, in which a peer provides uploading to the peers who reciprocally upload
to it.
12.3 SENSORNETWORKS
A wireless sensor network is a collection of sensor nodes that are equipped with a wireless transceiver.
Depending on the application, the nodes can sense temperature, humidity, acceleration, intensity of
light, sound, the presence of chemicals or biological agents, or other aspects of the environment. The
nodes can relay packets to each other and to a gateway to the Internet or to some supervisor host.
Potential applications of this technology include environment monitoring to detect pollution
of hazardous leaks, the health of a forest, the watering of a vineyard, the motion of animals for
scientic investigations, the vital signs of patients, the structural integrity of a building after an
earthquake, the noise around a urban environment, ground motions as indicators of seismic activities,
avalanches, vehicular trafc on highways or city streets, and so on.
This technology has been under development for a dozen years. Although there are relatively
few major deployment of wireless sensor networks to date, the potential of the technology warrants
paying attention to it. In this section, we discuss some of the major issues faced by wireless sensor
networks.
12.3.1 DESIGNISSUES
The rst observation is that which issues are important depend strongly on the specic application.
It might be tempting to develop generic protocols for sensor networks as was done for the Internet,
hoping that these protocols would be exible enough to support most applications. However, expe-
rience has shown that this is a misguided approach. As the discussion below shows, it is critical to
be aware of the specic features of the application on hand when developing the solution.
Energy
Imagine thousands of wireless sensor nodes deployed to measure trafc on a highway system. It is
important for the batteries of these nodes to last a long time. For instance, say that there are ten
thousand nodes with a battery life of 1,000 days. On average, ten batteries must be replaced every
day. Obviously, if the sensors have access to a power source, this issue does not matter; this is the
case for sensors on board a vehicle or sensors that can be connected to the power grid.
152 12. ADDITIONALTOPICS
Measurements of typical sensor nodes show that the radio system consumes much more
than the other components of the nodes. Moreover, the radios consume about the same amount
in the receive, idle, and transmit modes. Consequently, the only effective way to reduce the energy
consumption of such nodes is to make the radio system sleep as much as possible. In the case of the
highway sensors, it would be efcient to keep the sensors active for counting the trafc and turn
on the radios periodically just long enough for the nodes to relay their observations. The nodes can
synchronize their sleep/wakeup patterns by using the reception time of packets to resynchronize
their clock.
Researchers have explored techniques for scavenging energy using solar cells, motion-activated
piezoelectric materials, thermocouples, or other schemes.
Location
Imagine hundreds of sensor nodes dropped from a helicopter on the desert oor with the goal of
detecting and tracking the motion of army vehicles. For the measurements to be useful, the nodes
must be able to detect their location. One possibility is to equip each node with a GPS subsystem
that identies its precise location. However, the cost of this approach may be excessive. Researchers
have designed schemes for the nodes to measure the distances to their neighboring nodes. One
method is to add an ultrasonic transceiver to each node. Say that node A sends a chirp and that its
neighbor B replies with its own chirp. By measuring the delay, node A can estimate its distance to
B. The chirps can include the identication of the chirping nodes. A similar approach is possible
using radio transmitters, as used by airplane transponders.
The problem of determining locations from pairwise distances between nodes is nontrivial
and has received considerable attention. The rst observation is that if one builds a polygon with
the sides of a given length, the object may not be rigid. For instance, a square can be deformed
into a rhombus which has different node locations. Thus, a basic requirement to locate the nodes is
that they must gather enough pairwise distances for the corresponding object to be rigid. A second
observation is that a rigid object may have multiple versions that correspond to rotations, ips,
or centrally symmetric modications. The nodes need enough information to disambiguate such
possibilities. A third observation is that even if the locations are unique given the measurements,
nding an efcient algorithm to calculate these locations is also nontrivial and a number of clever
schemes have been developed for this purpose.
A simpler version of the location problem arises in some applications. Say that every light
xture and every thermostat in a building has a unique identication number. However, the installer
of these devices does not keep track of the specic location of each device. That is, the locations of
all the nodes are known but the mapping of ID to location has to be determined after the fact. We
let you think of viable approaches to solve this problem.
Addresses and Routing
Is it necessary for every sensor node to have a unique IP address? In the Internet, NATdevices enable
to reuse IP addresses. A similar approach can be used for sensor nodes. However, the nodes need
12.4. DISTRIBUTEDAPPLICATIONS 153
to use UDP/IP protocols and simpler addressing schemes are conceivable. For instance, in some
applications, the location may be a suitable address. If the goal is to nd the temperature in a room
of a building, the specic ID of the corresponding sensor does not matter. If the nodes are mobile,
then a suitable addressing based on location is more challenging.
Routing in a sensor network may be quite challenging, depending on the application. If the
nodes are xed and always send information to a specic gateway, then the network can run some
shortest path algorithm once or rarely. If the nodes are moving, the routing algorithm can either run
in the background or nodes can discover routes when needed. As for the other protocols, routing
should be designed with a good understanding of the characteristics of the application.
In-Network Processing and Queries
Say that the goal of a sensor network is to measure the highest temperature to which its nodes
are exposed. One approach is for all the nodes to periodically report their temperature and for the
supervising host to calculate the largest value. A different approach if for each node to compare its
own temperature with the value that its neighbors report; the node then forwards only the maximum
value. A similar scheme can be designed to measure the average value of the temperatures. More
generally, one can design a communication scheme together with processing rules by the individual
nodes to calculate a given function of the node measurements. The goal may be to minimize the
number of messages that the nodes must transmit.
For a given sensor network, one may want to design a query language together with an
automatic way of generating the processing rules and the messages to be exchanged to answer the
query.
12.4 DISTRIBUTEDAPPLICATIONS
Networks execute distributed applications to implement many different protocols including BGP,
OSPF, RIP and TCP. More generally, nodes on a network can be used to implement distributed
applications for users. The properties of distributed applications are of interest to designers of the
applications and the protocols. This section explores a number of representative applications and
their properties.
12.4.1 BELLMAN-FORDROUTINGALGORITHM
In the Routing chapter, we explained that the Bellman-Ford algorithm converges after a nite
number of steps if the network topology does not change during that time. Recall that when running
this algorithm, each node i = 1, 2, . . . , J maintains an estimate x
n
(i) of its shortest distance to a
given destination node, say node D. Here, n = 0, 1, . . . denote the algorithm steps. In the basic
version of the algorithm, if node i gets a new message x
n
(j) from a neighbor j, the node updates
its estimate to x
n+1
(i) that it calculates as follows:
x
n+1
(i) = min{x
n
(i), d(i, j) +x
n
(j)}.
154 12. ADDITIONALTOPICS
In this expression, d(i, j) is the length of the link from i to j. The initial values are x
0
(i) =
for i = D and x
0
(D) = 0. If the network topology does not change, this algorithm results in non-
increasing values of x
n
(i). Let the shortest path from i to D be the path (i, i
1
, i
2
, . . . , i
k1
, i
k
, D).
Eventually, there is one message from D to i
k
, then one from i
k
to i
k1
, and so on, then one from
i
1
to i. After those messages, x
n
(i) is the shortest distance from i to D.
To make the algorithm converge when the topology changes, one has to modify the update
rules. One modication is to let a node reset its estimate to if it gets a higher estimate from a
neighbor. Such an increase indicates that the length of a link increased and that one should restart
the algorithm. To implement this modication, the nodes must remember the estimates they go
from their neighbors.
12.4.2 TCP
In the Models chapter, we briey discussed how TCP can be viewed as a primal-dual algorithm to
maximize the sum of the utilities of the ows in the network subject to the capacity constraints. This
primal-dual algorithm is distributed because of the particular structure of the primal problem. Let
us review how this works.
The primal problem is as follows:
Maximize
i
u
i
(x
i
)
Subject to Ax C.
In this formulation, u
i
(x
i
) is the utility that the user of connection i derives from the rate x
i
of the connection. Also, A is the incidence matrix whose entry A(k, i) is equal to one if connection
i goes through link k. Finally, C is the column vector of link capacities C
k
. The constraints Ax C
mean that the total rate of the connections that go through link k cannot exceed the capacity C
k
of
the link, for every k.
The dual version of this problem is as follows:
Let L(x, ) =
i
u
i
(x
i
)
T
[Ax C]
D() := L(x(), ) = max
x
L(x, )
= argmin
0
D().
Here, is the column vector whose components
k
are the Lagrange multipliers that corre-
spond to the capacity constraints. One knows that x(
. The algorithm is
12.4. DISTRIBUTEDAPPLICATIONS 155
x
i
(n +1) = x
i
(n) +
x
i
L(x(n), (n)) = [x
i
(n) +u
i
(x
i
(n)) [
T
A]
i
]
+
k
(n +1) =
k
(n) +
k
L(x(n), (n)) = [
k
(n) +[Ax C]
k
]
+
.
In these expressions, and are step sizes.
One key observation is that the update of x
i
depends only on the sum[
T
A]
i
of the Lagrange
multipliers along the path that connection i uses. In particular, this update does not depend explicitly
on the utility functions of the other connections, nor on the rates of these connections or even on
the number of such connections. All this dependency on the factors not known to connection i is
summarized in the sumof the Lagrange multipliers. InTCP, one can think that the loss rate (or total
delay) that the connection sees as a measure of this sum of the Lagrange multipliers. The reason for
this limited dependency is twofold. Firstly, the objective function in the primal problem is the sum
of the utilities of the connections, so that the derivative with respect to x
i
does not depend on u
j
(x
j
)
for j = i. Secondly, x
i
enters in the capacity constraint for link k as
k
x
i
if connection i uses link k.
The gradient is then the sum of the
k
and does not depend on the rates of the other connections
or even on which other connections use link k.
Similarly, the update of
k
is governed by the derivative of L with respect to
k
. This is the
derivative of the inequality constraint term giving the difference between the rate through link k
and the capacity C
k
of that link. Thus, the update of
k
is that difference and it does not depend on
the utilities nor on the details of which connections contribute to the rate through the link.
Our discussion justies the fact that the primal-dual algorithm is decomposed into separate
updates for the routers that adjust the multipliers
k
and the hosts that adjust their rates x
i
. The fact
that the algorithm converges (for suitable step sizes) is due to the convexity of the problem.
12.4.3 POWERADJUSTMENT
Another distributed algorithm in networks is that used by wireless CDMA nodes to adjust their
power. CDMA, for code division multiple access, is a mechanismthat allows different nodes to transmit
at the same time in some range of frequencies by assigning them different codes which makes their
transmissions almost orthogonal. The details of this mechanism are not essential for our discussion.
What is important for us is the observation that a transmission fromone node can cause interference
at the other nodes. The power adjustment problem is to nd a scheme for the nodes to adjust
their transmission power so that the communications between nodes are successful. The idea is
that a transmitter may want to increase its power so that its receiver gets a more powerful signal.
However, by increasing its power, the transmitter generates more interference for other receivers and
deteriorates their operation. How can the nodes gure out a good tradeoff in a distributed way? This
situation is not unlike TCP where an increase in the rate of one connection increases the losses of
other connections.
156 12. ADDITIONALTOPICS
To formulate the problem, imagine that there are pairs of (transmitter, receiver) nodes. Suppose
for the pair i, the associated transmitter i sends packets to the associated receiver i with power P
i
.
Let G(i, j) be the fraction of the power P
i
that reaches receiver j. In addition, assume that receiver j
also hears some noise with power
j
due to the thermal noise and the sources external to the nodes of
the network under consideration. Thus, receiver j receives the power P
j
G(j, j) from the associated
transmitter, noise power
j
and interference power
i=j
P
i
G(i, j) from the other transmitters.
The key quantity that determines the quality of the operations of node j is thesignal-to-
interference-plus-noise ratio, SINR. For node j, this ratio is R
j
given by
R
j
=
P
j
G(j, j)
i=j
P
i
G(i, j) +
j
.
That is, R
j
measures the power of the signal P
j
G(j, j) from its transmitter divided by the sum of
the power of the noise plus that of the interference from the other transmitters.
The communication from transmitter j to receiver j is successful if R
j
j
. Intuitively, if
the signal is sufciently powerful compared with the noise and interference, then the receiver is able
to detect the bits in the signal with a high enough probability. Mathematically, the power adjustment
problem is then to nd the vector P = (P
i
) of minimum powers such that R
j
j
for all j. We
can write the constrain R
j
j
as follows:
P
j
G(j, j)
j
[
i=j
P
i
G(i, j) +
j
].
Equivalently,
P
j
i=j
P
i
A(i, j) +
j
where A(i, j) =
j
G(j, j)
1
G(i, j) for i = j and
j
=
j
G(j, j)
1
j
. Dening A(i, i) = 0, we
can write these inequalities in the vector form as
P PA +.
A simple adjustment scheme is then
P(n +1) = P(n)A +, n 0
where P(n) is the vector of powers that the transmitters use at step n of the algorithm. To explore
the convergence of the algorithm, assume that
P
= P
A +.
Then we nd that
P(n +1) P
= [P(n) P
]A,
12.5. BYZANTINEAGREEMENT 157
so that, by induction,
P(n) P
= [P(0) P
]A
n
, for n 0.
It can be shown that if the eigenvalues of matrix A have magnitude less than 1, then P(n)
converges to P
i=j
P
i
(n)A(i, j) +
j
=
j
G(j, j)
1
[
i=j
P
i
(n)G(i, j) +
j
].
We can write this update rule as follows:
P
j
(n +1) =
j
G(j, j)
1
N
j
(n)
where N
j
(n) =
i=j
P
i
(n)G(i, j) +
j
is the total interference plus noise power that receiver j
hears. Thus, if transmitter j knows G(j, j) and N
j
(n), it adjusts its transmission power so that its
receiver gets a SNIR exactly equal to the target
j
. To implement this algorithm, receiver j must
indicate to its transmitter the value of N
j
(n) and also that of P
j
(n)G(j, j), so that the transmitter
can determine G(j, j).
Thus, the power adjustment of CDMA nodes has a simple distributed solution that requires
a minimum amount of exchange of control information between the transmitters and receivers. The
solution is simple because a receiver only needs to indicate to its transmitter the power of the signal
it receives and the total power of interference and noise.
12.5 BYZANTINEAGREEMENT
Many applications in a network require nodes to coordinate their actions by exchanging messages.
A fundamental question concerns the reliability of such coordination schemes when the network
may fail to deliver messages. We consider some examples.
12.5.1 AGREEINGOVERANUNRELIABLECHANNEL
Consider two generals (A and B) who want to agree whether to attack a common enemy tomorrow
at noon. If they fail to attack jointly, the consequences are disastrous. They agree that general A
will make the decision whether to attack or not and will then send a messenger with the decision
to general B. General B can then send the messenger back to general A to conrm, and so on.
However, the messenger has a small probability of getting captured when it travels between the
generals. Interestingly, there is no protocol that guarantees success for generals A and B.
To appreciate the problem, say that general A sends a message with the decision to attack.
General A cannot attack unless it knows that B got the message. To be sure, B sends the messen-
ger back to A with an acknowledgment. However, B cannot attack until it knows that A got the
acknowledgment. For that reason, A sends back the messenger to B to inform him that he got the
158 12. ADDITIONALTOPICS
acknowledgment. However, A cannot attack until it knows that B got his acknowledgment, and so
on.
Let us prove formally that no algorithm can solve this agreement problem. By solving the
problem, we mean that if general A decides not to attack, both A and B should eventually agree not
to attack. Also, if A decides to attack, then A and B should eventually agree to attack. The proof is
by contradiction. Say that some algorithm solves the problem. Consider the sequence of steps of this
algorithm when general A decides to attack and all the messages get delivered. Assume that A and
B agree to attack after n messages (the rst one from A to B, the second one from B to A, and so on).
Say that n is even, so that the last message is from B to A. Consider what happens if that message
is not delivered. In that case, B still decides to attack. Thus, A knows that B will decide to attack
whether A gets message n or not. Accordingly, A must decide to attack after it gets message n 2.
By symmetry, one concludes that B must decide to attack after it gets message n 3. Continuing in
this way, we conclude that A and B must agree to attack even if no message is exchanged. But then,
they would also agree to attack even if A had decided not to attack, a contradiction. Hence, there
cannot be an algorithm that solves the problem. A similar argument can be constructed for an odd
n.
This problem has been analyzed when there is a lower bound on the probability that each
message is lost and one can show that there is an algorithm that solves the problem with a high
probability.
12.5.2 CONSENSUS INTHEPRESENCEOF ADVERSARIES
Consider that there are three agents A, B, C. One of them is dishonest but the other two do not
know who he is. We want to design an algorithm where the honest agents agree on a common value
in {0, 1}. In particular, if the two honest agents start with the same preferred choice X {0, 1},
then they should eventually agree on the choice X. It turns out that no such algorithm exists. More
generally, in a network with N agents, no consensus protocol exists if there are at least N/3 dishonest
agents.
We do not prove that no algorithm exists. Instead, we illustrate the difculty by considering a
sensible algorithmand showing that it does not solve the problem. In the rst round of the algorithm,
the honest agents report their choice and in the second round report what they heard from the other
agents. We showthat this algorithmdoes not solve the consensus problem. To showthis, we consider
three different executions of the algorithm and show that two honest agents may settle on different
values (0 for one and 1 for the other), which violates the objective of the consensus algorithm.
Table 12.1 shows the case when A and B are honest and start with the choice 1 while C is
dishonest and starts with the choice 0. The table shows the messages that the nodes send to each
other in the two rounds. Note that in the second round C tells A that B told him that his choice
was 0 (while in fact B reported his choice to be 1 in the rst round).
12.5. BYZANTINEAGREEMENT 159
Table 12.1: Case A(1), B(1), C(0) when C lies
to A in the second round.
A B C A B C
A(1) - 1 1 - C = 0 B = 1
B(1) 1 - 1 C = 0 - A = 1
C(0) 0 0 - B = 0 A = 1 -
Table 12.2 considers the case when A is dishonest and starts with the choice 1 while B and
C are honest and start with the choice 0. In the second round, A lies to C by reporting that B told
him that his choice was 1 in the rst round (instead of 0).
Table 12.2: Case A(1), B(0), C(0) when A lies
to C in the second round.
A B C A B C
A(1) - 1 1 - C = 0 B = 1
B(0) 0 - 0 C = 0 - A = 1
C(0) 0 0 - B = 0 A = 1 -
Finally, table 12.3 considers the case where B is dishonest and starts with the choice 1 while
A and C are honest and start with the choices 1 and 0, respectively. In this case, B lies to C in the
rst round and all other messages are truthful.
Table 12.3: Case A(1), B(1), C(0) when B lies
to C in the rst round.
A B C A B C
A(1) - 1 1 - C = 0 B = 1
B(1) 1 - 0 C = 0 - A = 1
C(0) 0 0 - B = 0 A = 1 -
To see why this algorithm cannot solve the consensus problem, note that in the rst case A
and B should end up agreeing with the choice 1, in the second case, B and C should agree with 0.
Since A has the same information in the rst and the third cases, he cannot distinguish between
them. Similarly, C cannot distinguish between the second and the third cases. Since A must settle
on 1 in the rst case, he would settle on 1 in the third case as well. Similarly, since C must settle on
0 in the second case, he would settle on 0 in the third case as well. Thus, in the third case, A and C
end up settling on different values.
160 12. ADDITIONALTOPICS
12.6 SOURCECOMPRESSION
Source compression is a method that reduces the number of bits required to encode a data le. One
commonly used method is based on the Lempel-Ziv algorithm. To explain how the algorithm works,
imagine that you want to compress a book. As you read the book, you make a list of the sentences
that you nd in the book, say of up to ten words. As you go along, you replace a sentence by a pointer
to that sentence in your list. Eventually, it may be that you end up mostly with pointers. The idea is
that the pointers take fewer bits than the sentences they point to.
The effectiveness of the scheme depends on the redundancy in the text. To appreciate this,
consider using the same method for a binary le. Assume that you maintain a list of strings of 20
bits. If all the strings appear in the le, then you need 20-bit pointers to identify the strings. As a
result, you replace a string of 20 bits by a 20-bit pointer, which does not achieve any compression.
Now assume that there is quite a bit of redundancy so that you need only 8 bits to encode the 2
8
20-bit strings that show up in the le. In that case, each 20-bit string gets replaced by an 8-bit
pointer, which achieves some signicant compression.
This method is effective on binary strings corresponding to les with redundancy, such as
text. For video and audio, different compression methods are used that exploit the specic features
of such les.
12.7 SUMMARY
This chapter reviews a few of the many additional topics related to networks.
An Overlay Network is built on top of another network. Examples include peer-to-peer net-
works such as BitTorrent and content distribution networks such as Akamai. Two main issues
are how to nd resources and routing.
A wireless sensor network is a collection of nodes equipped with sensors and radio tranceivers.
The nodes communicate in an ad hoc way. Issues include low-energy MACprotocols, location,
routing, and reliability.
Routers in the Internet and nodes in ad hoc networks implement distributed routing protocols.
Hosts in various networks may also implement various distributed algorithms. We discussed
convergence properties of distributed applications such as Bellman-Ford, TCP, and power
control.
We discussed the impossibility results for Byzantine agreement problems.
Source compression enables economical storage and delivery of les, audio, and video streams.
The chapter explains the Lempel-Ziv algorithm for le compression.
12.8. REFERENCES 161
12.8 REFERENCES
For an overviewof overlay networks, see (3). Distributed algorithms are analyzed in (32). See also (9)
for recent results and (7) for a nice survey of some key issues. A cute animation of the Lempel-Ziv
algorithm can be found at (29). The distributed power control algorithm is from (16).
163
Bibliography
[1] N. Abramson and F.F. Kuo (Editors). Computer communication networks (Prentice-Hall, 1973)
44
[2] R. Ahlswede, N. Cai, S. R. Li, and R. W. Yeung, Network Information Flow, IEEE Trans-
actions on Information Theory, 2000. DOI: 10.1109/18.850663 76
[3] D.G. Andersen, Overlay Networks: Networking onTopof the Network, Computing Reviews,
2004. http://www.reviews.com/hottopic/hottopic_essay_01.cfm 161
[4] J. Andrews, A. Ghosh and R. Muhamed. Fundamentals of WiMAX: Understanding Broadband
Wireless Networking. Prentice Hall, 2007. 118, 123
[5] Paul Baran, On Distributed Communications Networks, IEEETransactions on Communica-
tions Systems, March 1964.
http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=1088883.
DOI: 10.1109/TCOM.1964.1088883 7
[6] Richard Bellman, On a Routing Problem," Quarterly of Applied Mathematics, 16(1), pp.87-90,
1958. 76
[7] Dimitri P. Bertsekas and John N. Tsitsiklis, Some Aspects of Parallel and Dis-
tributed Iterative Algorithms - A Survey, IFAC/IMACS Symposium on Distributed
Intelligence Systems, 1988. http://www.mit.edu:8001/people/dimitrib/Some_Aspec
DOI: 10.1016/0005-1098(91)90003-K 161
[8] G. Bianchi, Performance Analysis of the IEEE 802.11 Distributed Coordination Function,
IEEE Journal on Selected Areas in Communications, Vol. 18, No. 3, March 2000. 49, 50, 51, 52,
53, 57, 58
[9] Vincent D. Blondel, Julien M. Hendrickx, Alex Olshevsky, and John N. Tsitsiklis, Con-
vergence in multiagent coordination, consensus, and ocking, Proceedings of the Joint
44th IEEE Conference on Decision and Control and European Control Conference, 2005.
DOI: 10.1109/CDC.2005.1582620 161
[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 111,
112
164 12. ADDITIONALTOPICS
[11] D.-M. Chiu and R. Jain, Networks with a connectionless network layer; part iii: Analysis of the
increase and decrease algorithms, Tech. Rep. DEC-TR-509, Digital Equipment Corporation,
Stanford, CA, Aug. 1987. 7, 100
[12] D. Chiu and R. Jain, Analysis of the Increase/Decrease Algorithms for Congestion Avoid-
ance in Computer Networks," Journal of Computer Networks and ISDN, Vol. 17, No. 1,
June 1989, pp. 1-14. http://www.cse.wustl.edu/jain/papers/ftp/cong_av.pdf
DOI: 10.1016/0169-7552(89)90019-6 7, 100
[13] G. Di Caro and M. Dorigo, AntNet: Distributed Stigmergetic Control for Communications
Networks, Journal of Articial Intelligence Research, 1998. 76
[14] E. W. Dijkstra, A note on two problems in connexion with graphs, Numerische Mathematik,
1 (1959), S. 269271. DOI: 10.1007/BF01386390 76
[15] H. Ekstrom, QoS Control in the 3GPP Evolved Packet System, IEEE Communications
Magazine, Vol. 47, No. 2, February 2009. DOI: 10.1109/MCOM.2009.4785383 122, 123
[16] G Foschini and Z Miljanic, A simple distributed autonomous power control algorithm and
its convergence, IEEETrans. Vehicular Technology, 1993. DOI: 10.1109/25.260747 161
[17] M. Gast, 802.11 Wireless Networks: The Denitive Guide, Second Edition. OReilly, 2005. 48,
49, 58
[18] R. Gibbens and F.P. Kelly, Measurement-Based Connection Admission Control, 5th Int.
Teletrafc Congr., June 1997 134
[19] Timothy G. Grifn and Gordon Wilfong, An Analysis of BGP Convergence Properties,
SIGCOMM99 DOI: 10.1145/316194.316231 76
[20] P. Hoel, S. Port, and C. Stone. Introduction to Probability Theory. Houghton-Mifin Harcourt,
1977. 44
[21] P. Hoel, S. Port, and C. Stone. Introduction to Stochastic Processes. Houghton-Mifin Harcourt,
1972. 55
[22] Van Jacobson, Congestion Avoidance and Control, SIGCOMM88. Later rewritten with M.
Karels. http://ee.lbl.gov/papers/congavoid.pdf 7, 100
[23] LibinJiang andJ. Walrand, ADistributedCSMAAlgorithmfor Throughput andUtility Max-
imization in Wireless Networks, Allerton 2008. DOI: 10.1109/ALLERTON.2008.4797741
[24] Kelly, F.P., Maulloo, A. andTan, D. (1998) Rate control for communication networks: shadow
prices, proportional fairness and stability, Journal of the Operational Research Society, Vol.49,
No.3, pp.237-252.
112
12.8. REFERENCES 165
[25] Leonard Kleinrock. Message Delay in Communication Nets with Stor-
age (PhD thesis). Cambridge: Massachusetts Institute of Technology, 1964.
http://dspace.mit.edu/bitstream/handle/1721.1/11562/33840535.pdf 7
[26] Bob Kahn and Vint Cerf, A Protocol for Packet Network Intercommunication, IEEETrans-
actions of Communications Technology, May 1974. 7
[27] S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, J. Crowcroft, XORs in The Air: Prac-
tical Wireless Network Coding, SIGCOMM06, September 1115, 2006, Pisa, Italy.
DOI: 10.1145/1151659.1159942 76
[28] Kleinrock, L. and R. Muntz, Multilevel Processor-Sharing Queueing Models for Time-
Shared Systems, Proceedings of the Sixth International Teletrafc Congress, Munich, Germany,
pp. 341/1-341/8, September 1970. 134
[29] Animation of the Lempel-Ziv Algorithm. Data-Compression.com.
http://www.data-compression.com/lempelziv.html 161
[30] Steven H. Low and David E. Lapsley , Optimization Flow Control, I: Basic Algorithm and
Convergence, IEEE/ACMTransactions on Networking, 1999. DOI: 10.1109/90.811451 112
[31] Little, J. D. C., A Proof of the Queueing Formula L = W, Operations Research, 9, 383-387
(1961).
http://www.doc.ic.ac.uk/uh/for-paul/
A%20PROOF%20FOR%20THE%20QUEUING%20FORMULA%20L=lW.%207689045.pdf
DOI: 10.1287/opre.9.3.383 25
[32] Nancy Lynch. Distributed Algorithms. Morgan Kaufmann, 1996. 161
[33] Robert M. Metcalfe and David R. Boggs, Ethernet: Distributed Packet Switching for
Local Computer Networks, Xerox Parc Report CSL757 May 1975, reprinted Febru-
ary 1980. A version of this paper appeared in Communications of the ACM, vol. 19
no. 7, July 1976. http://ethernethistory.typepad.com/papers/EthernetPaper.pdf
DOI: 10.1145/360248.360253 44
[34] J. Mo and J. Walrand, Fair End-to-End Window-based Congestion Control, IEEE/ACM
Trans. Networking 8, 5, Pages 556 - 567, October 2000. DOI: 10.1109/90.879343 112
[35] H. Myung, J. Lim, and D. Goodman, Single Carrier FDMA for Uplink
Wireless Transmission, IEEE Vehicular Technology Magazine, September 2006.
DOI: 10.1109/MVT.2006.307304 120, 123
[36] J. Nagle, On packet switches with innite storage, IEEETrans. on Comm., 35(4), April 1987.
DOI: 10.1109/TCOM.1987.1096782 134
166 12. ADDITIONALTOPICS
[37] M.J. Neely and E. Modiano and C-P. Li, Fairness and Optimal Stochastic
Control for Heterogeneous Networks, Proceedings of IEEE INFOCOM, 2005.
DOI: 10.1109/INFCOM.2005.1498453 112
[38] W. B. Norton, Internet Service Providers and Peering, 2000.
http://www.cs.ucsd.edu/classes/wi01/cse222/papers/norton-isp-draft00.pdf
76
[39] A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to ow control in
integrated service networks : The single node case. IEEE/ACMTransactions on Networking,
1(3), June 1993. DOI: 10.1109/90.234856 134
[40] J. Proakis. Digital Communications. McGraw-Hill, 2000. 145
[41] A. Puri and S. Tripakis, Algorithms for Routing with Multiple Constraints, In AIPS02
Workshop on Planning and Scheduling using Multiple Criteria, 2002 76
[42] R. Ramaswami and K. N. Sivarajan. Optical Neworks - A Practical Perspective. Second Edition.
Morgan Kauffman, 2000. 145
[43] Saltzer, J., Reed, D., and Clark, D.D., End-to-End Arguments in System Design, Sec-
ond International Conference on Distributed Computing Systems (April 1981) pages 509-
512, ACM Transactions on Computer Systems, 1984, Vol. 2, No. 4, November, pp. 277-288
DOI: 10.1145/357401.357402 25
[44] Shannon, C.E. (1948), A Mathematical Theory of Communica-
tion, Bell System Technical Journal, 27, pp. 379423 623656.
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
DOI: 10.1145/584091.584093 25
[45] A. Shokrollahi, Raptor Codes, IEEE Transactions on Information Theory, vol. 52, pp. 2551-
2567, 2006 DOI: 10.1109/TIT.2006.874390 76
[46] T. Slater, Queuing Theory Tutor,
http://www.dcs.ed.ac.uk/home/jeh/Simjava/queueing/ 25
[47] Chakchai So-In, Raj Jain, and Abdel-Karim Tamimi, Scheduling in IEEE 802.16e Mobile
WiMAXNetworks: Key Issues and a Survey, IEEEJournal on Selected Areas in Communications
(JSAC), Vol. 27, No. 2, February 2009. DOI: 10.1109/JSAC.2009.090207 117, 123
[48] R, Srikant, The Mathematics of Internet Congestion Control. Birkhuser, 2004. 112
[49] I. Stojmenovic, Position based routing in ad hoc networks, IEEE Communications Magazine
40 (7): 128134, 2002. DOI: 10.1109/MCOM.2002.1018018 76
12.8. REFERENCES 167
[50] L. Tassiulas and A. Ephremides, Stability properties of constrained queueing systems and
scheduling policies for maximum throughput in multihop radio networks, IEEETransactions
on Automatic Control, 37; 1936-48, 1992. DOI: 10.1109/9.182479 112
[51] Mobile WiMAX Part I: A Technical Overview and Performance Evaluation, WiMAX Fo-
rum Whitepaper.
http://www.wimaxforum.org/technology/downloads/
Mobile_WiMAX_Part1_Overview_and_Performance.pdf 118, 123
[52] http://www.3gpp.org/ftp/Specs/html-info/36-series.htm. 119, 123
[53] 3GPP TS 36.300 V8.8.0, "Overall Description, Stage 2 (Release 8)," March 2009. 119, 120,
121, 123
[54] 3GPPTS 36.211 V8.6.0, "Physical Channels and Modulation (Release 8)," March 2009. 120,
121, 123
[55] 3GPP TS 23.203 V9.0.0, "Policy and Charging Control Architecture (Release 9)," March
2009. 122, 123
[56] IEEE 802.11-1999 Specication. 48, 58
[57] IEEE 802.11a-1999 Specication. 49, 58
[58] IEEE 802.11b-1999 Specication. 49, 58
[59] IEEE 802.16-2004 Standard. 114, 123
[60] IEEE 802.16-2005 Standard. 114, 123
[61] IEEE P802.16Rev2/D3, February 2008. 115, 123
[62] David C. Plummer, An Ethernet Address Resolution Protocol, RFC 826, 1982.
http://www.ietf.org/rfc/rfc826.txt 85
[63] J. Postel (Editor), Transmission Control Protocol, RFC 793, 1981.
http://www.ietf.org/rfc/rfc0793.txt 100
[64] Mogul, J., and J. Postel, Internet Standard Subnetting Procedure, RFC 950, 1985.
http://www.ietf.org/rfc/rfc950.txt 85
[65] P. Mockapetris, Domain names - Concepts and Facilities, RFC 1034, 1987.
http://www.ietf.org/rfc/rfc1034.txt 25
[66] R. Braden (Editor), Requirements for Internet Hosts Communication Layers, RFC 1122.
http://tools.ietf.org/html/rfc1122 25
168 12. ADDITIONALTOPICS
[67] V. Fuller, T. Li, J. Yu, and K. Varadhan, Classless Inter-Domain Routing
(CIDR): an Address Assignment and Aggregation Strategy, RFC1519, 1993.
http://www.ietf.org/rfc/rfc1519.txt 7, 25
[68] Y. Rekhter et al, A Border Gateway Protocol 4 (BGP-4), RFC 1771, 1995.
http://www.ietf.org/rfc/rfc1771.txt 76
[69] M. Mathis et al., TCP Selective Acknowledgment Options, RFC 2018, 1996.
http://www.ietf.org/rfc/rfc2018.txt 92
[70] R. Droms, Dynamic Host Conguration Protocol, RFC 2131, 1997.
http://www.ietf.org/rfc/rfc2131.txt 85
[71] M. Allman et al., TCP Congestion Control, RFC 2581, 1999.
http://www.ietf.org/rfc/rfc2581.txt 100
[72] G. Tsirtsis et al., Network Address Translation - Protocol Translation (NAT-PT), RFC2766,
2000. http://www.ietf.org/rfc/rfc2766.txt 85
[73] C. Perkins et al., Ad hoc On-Demand Distance Vector (AODV) Routing, RFC 3561, 2003.
http://www.ietf.org/rfc/rfc3561.txt 72, 76
[74] T. Clausen et al., Optimized Link State Routing Protocol (OLSR), RFC 3626, 2003.
http://www.ietf.org/rfc/rfc3626.txt 72, 76
[75] G. Huston et al., Textual Representation of Autonomous System(AS) Numbers, RFC5396,
2008. http://www.ietf.org/rfc/rfc5396.txt
169
Authors Biographies
JEANWALRAND
Jean Walrand received his Ph.D. in EECS from UC Berkeley, and has been on the faculty of
that department since 1982. He is the author of An Introduction to Queueing Networks (Prentice
Hall, 1988) and of Communication Networks: A First Course (2nd ed. McGraw-Hill,1998), and
co-author of High-Performance Communication Networks (2nd ed, Morgan Kaufman, 2000) and of
Scheduling and Congestion Control for Communication and Processing Networks (Morgan & Claypool,
2010). His research interests include stochastic processes, queuing theory, communication networks,
game theory and the economics of the Internet. Prof. Walrand is a Fellow of the Belgian American
Education Foundation and of the IEEE, and a recipient of the Lanchester Prize and of the Stephen
O. Rice Prize.
SHYAMPAREKH
Shyam Parekh received his Ph.D. in EECS from UC Berkeley in 1986, and is currently a Distin-
guished Member of Technical Staff in the Network Performance & Reliability department at Bell
Labs, Alcatel-Lucent. He has previously worked at AT&TLabs, TeraBlaze andTidal Networks, and
has an ongoing afliation for research and teaching with the EECS department at UC Berkeley. He
is co-editor of Quality of Service Architectures for Wireless Networks (Information Science Reference,
2010). His research interests include architecture, modeling and analysis of both wired and wireless
networks. He was co-chair of the Application Working Group of the WiMAX Forum during 2008.
171
Index
CW
max
, 46
CW
min
, 46
-Fair, 103
3GPP, 113
Access Point (AP), 45
Ad Hoc Network, 72
Address
Broadcast, 31
Ethernet, 31
Group, 31
IP, 2
MAC, 31
Private, 82
Translation, 82
WiFi, 48
Address Resolution Protocol (ARP), 79
AIMD, 93
Aloha, 35
Non-Slotted, 35
Pure, 35
Time-Slotted, 35
Aloha Network, 28
Ant Routing, 73
Antenna Length, 137
Anycast, 22, 67
AODV, 72
ARQ, 4
Automatic Retransmission Request, 4
Autonomous Systems, 59
Backoff Time, 29
Backpressure Algorithm, 109
Backpressure Routing, 73
Bandwidth, 11
Basic Service Set (BSS), 45
Bellman-Ford Algorithm, 65
BGP, 61
Oscillations, 62
Border Gateway Protocol (BGP), 61
Bottleneck Link, 12
BPSK, 138
Broadband, 10
Broadcast, 67
Broadcast Address, 31
Broadcast MAC Address, 79
Byzantine Agreement, 157
Carrier Sensing, 29
CDMA, 155
Cellular System, 139
CIDR, 3
Circuit Switching, 1
Client/Server Model, 21
Cloud Computing, 22
Collision, 28
Collision Detection, 29
Complementary Slackness Conditions, 105
Congestion Avoidance, 96
Congestion Control, 5
Congestion Loss, 4
Content Distribution Network (CDN), 147
Content Distribution System, 22
172 INDEX
Cross Layer Designs, 101
CSMA/CA, 47
CTS, 45
Delay Jitter, 13
DHCP, 82
DIFS, 46
Digital Communication Link, 135
Dijkstras Algorithm, 64
Discriminator, 63
Distributed Coordination Function (DCF), 45
DNS, 5, 19
DNS Servers, 5
Domains, 59
DSL, 139
Dual Problem, 105
EIFS, 46
Encapsulation, 80
End-to-End Admission Control, 131
End-to-End Principle, 19
Error Checksum, 4
Error Detection, 4
Ethernet, 29
Ethernet Switch, 31
Exponential backoff, 29
Exposed Terminal Problem, 47
Extended Service Set, 49
Fairness, 102
Fast Recovery, 94
Fast Retransmit, 94
Fiber
Single Mode, 141
Step Index, 141
Flow Control, 5, 97
Forward Error Correction, 69
Forwarding Table, 2
Frequency, 10
Gateway Router, 79
Generalized Processor Sharing (GPS), 127
Geographic Routing, 73
Go Back N, 90
Gradient Algorithm, 105
Group Multicast Address, 31
Handovers, 118
Hertz, 10
Hidden Terminal Problem, 47
Host Names, 5
HTTP, 6
Hub, 27
Hub Ethernet, 30
Analysis, 36, 41
Independent Events, 38
Inter-Domain Routing, 59
Internet Hosts, 1
Intra-Domain Routing, 59
IP Address, 2
IP Fragmentation, 82
ISPs, 60
Lagrange Multipliers, 105
Lagrangian, 105
Layers, 20
Leaky Bucket, 126
Learning (Ethernet Switch), 32
Lempel-Ziv Algorithm, 160
Link, 1
Link Characteristics, 136
Link State Algorithm, 64
Littles Result, 15
Longest Prex Match, 3
LTE, 113
M/M/1 Queue, 13
MAC, 31
Address, 31
INDEX 173
Markov Chain, 54
Max-Min Fair, 103
Multicast, 22, 67
Multiplex, 88
Multiplexing Gain, 10
Narrowband, 10
NAT, 82
Network Allocation Vector (NAV), 48
Network Coding, 70
Network Neutrality, 131
OFDM, 49, 114, 140
OFDMA, 114, 140
OLSR, 72
OOK, 142
Optical Burst Switching, 143
Optical Link, 141
Optical Switching, 143
Overlay Network, 147
P2P, 21
P2P Network, 147
Packet, 1
Packet Erasure Code, 69
Packet Switching, 1
Parity Bit, 4
Path Vector Algorithm, 61
Peer-to-Peer Network, 147
Peering, 60
Physical Layer (Ethernet), 32
PLCP, 50
PON, 144
Port
TCP, 82
Ports, 27
Preamble (Ethernet), 32
Prex, 3
Primal Problem, 104
Private Addresses, 82
Probability, 37
Additivity, 38
Pull, 23
Pure Aloha
Analysis, 39
Push, 23
QAM, 139
QPSK, 138
Queuing Time, 14
Randomized Multiple Access, 28
Rate, 10
Receiver Advertised Window, 97
Round-Trip Time (RTT), 12
Router Port, 2
Routers, 1
Routing, 2
Routing Tables, 2
RTS, 45
Scalability, 59
Selective Acknowledgment, 91
Sensor Network, 151
Server Farm, 21
Shannon Capacity, 11
Shared Multicast Tree, 68
Shortest-Path Algorithm, 18
SIFS, 46
Signal-to-Noise Ratio, 11
Single Mode Fiber, 141
SINR, 156
Skin Effect, 137
Slot
WiFi, 46
Slotted Aloha
Analysis, 39
Slow Start, 96
Small World Topology, 59
174 INDEX
SNR, 11
Source Compression, 160
Spanning Tree Protocol, 33
Start of Frame Delimiter, 32
Stateless Routers, 19
Steiner Tree, 68
Step Index Fiber, 141
Step Size, 106
Stop and Wait, 90
Subnet Mask, 79
Subnets, 78
Switch, 27
Ethernet, 31
Switched Ethernet, 31, 32
TCP, 88
New Reno, 96
Reno, 96
SACK, 96
Tahoe, 96
TCP Header, 88
TCP Port, 82
TCP SACK, 91
TCP Timers, 92
Throughput, 11
Time Slot (Ethernet), 29
Tit-for-Tat, 151
Token Counter, 126
Transit, 60
Transmission Error, 4
Transport Layer Ports, 87
Tunnels, 147
Two-Level Routing, 59
UDP, 88
UDP Header, 88
Unicast, 67
Unlicensed Band, 45
URL, 6
Utilization, 13
Virtual Carrier Sensing, 48
VLAN, 33
Waiting for Success, 40
WDM, 142
Weighted Fair Queuing (WFQ), 127
WiFi, 45
WiMAX, 113
WWW - World Wide Web, 6
Zones, 5