Computer Networks Basics
Computer Networks Basics
Olivier Bonaventure
Contents
Table of Contents
1.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part 1: Principles
2.1 Connecting two hosts .
2.2 Building a network . . .
2.3 Applications . . . . . .
2.4 The transport layer . . .
2.5 Naming and addressing
2.6 Sharing resources . . . .
2.7 The reference models .
3
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
.
5
. 26
. 52
. 55
. 70
. 74
. 106
Part 2: Protocols
3.1 The application layer . . . . . . . . . . . .
3.2 The Domain Name System . . . . . . . . .
3.3 Electronic mail . . . . . . . . . . . . . . .
3.4 The HyperText Transfer Protocol . . . . .
3.5 Remote Procedure Calls . . . . . . . . . .
3.6 Internet transport protocols . . . . . . . . .
3.7 The User Datagram Protocol . . . . . . . .
3.8 The Transmission Control Protocol . . . .
3.9 The Stream Control Transmission Protocol
3.10 Congestion control . . . . . . . . . . . . .
3.11 The network layer . . . . . . . . . . . . .
3.12 The IPv6 subnet . . . . . . . . . . . . . .
3.13 Routing in IP networks . . . . . . . . . . .
3.14 Intradomain routing . . . . . . . . . . . .
3.15 Interdomain routing . . . . . . . . . . . .
3.16 Datalink layer technologies . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
113
116
125
134
137
138
139
156
161
167
185
191
192
197
210
Part 3: Practice
229
Appendices
5.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Indices and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231
231
235
235
Bibliography
237
Index
251
ii
Contents
CHAPTER 1
Table of Contents
1.1 Preface
This is the current draft of the second edition of the Computer Networking : Principles, Protocols and Practice.
The document is updated every week.
The first edition of this ebook has been written by Olivier Bonaventure. Laurent Vanbever, Virginie Van den
Schriek, Damien Saucez and Mickael Hoerdt have contributed to exercises. Pierre Reinbold designed the icons
used to represent switches and Nipaul Long has redrawn many figures in the SVG format. Stephane Bortzmeyer
sent many suggestions and corrections to the text. Additional information about the textbook is available at
http://inl.info.ucl.ac.be/CNP3
Note: Computer Networking : Principles, Protocols and Practice, (c) 2011, Olivier Bonaventure, Universite
catholique de Louvain (Belgium) and the collaborators listed above, used under a Creative Commons Attribution
(CC BY) license made possible by funding from The Saylor Foudnations Open Textbook Challenge in order to
be incorporated into Saylor.org collection of open courses available at http://www.saylor.org. Full license terms
may be viewed at : http://creativecommons.org/licenses/by/3.0/
CHAPTER 2
Part 1: Principles
To enable the two hosts to exchange information, they need to be linked together by some kind of physical media.
Computer networks have used various types of physical media to exchange information, notably :
 electrical cable. Information can be transmitted over different types of electrical cables. The most common
ones are the twisted pairs (that are used in the telephone network, but also in enterprise networks) and the
coaxial cables (that are still used in cable TV networks, but are no longer used in enterprise networks).
Some networking technologies operate over the classical electrical cable.
 optical fiber. Optical fibers are frequently used in public and enterprise networks when the distance between the communication devices is larger than one kilometer. There are two main types of optical fibers
: multimode and monomode. Multimode is much cheaper than monomode fiber because a LED can be
used to send a signal over a multimode fiber while a monomode fiber must be driven by a laser. Due to the
different modes of propagation of light, monomode fibers are limited to distances of a few kilometers while
multimode fibers can be used over distances greater than several tens of kilometers. In both cases, repeaters
can be used to regenerate the optical signal at one endpoint of a fiber to send it over another fiber.
 wireless. In this case, a radio signal is used to encode the information exchanged between the communicating devices. Many types of modulation techniques are used to send information over a wireless channel
and there is lot of innovation in this field with new techniques appearing every year. While most wireless
networks rely on radio signals, some use a laser that sends light pulses to a remote detector. These optical
techniques allow to create point-to-point links while radio-based techniques, depending on the directionality
of the antennas, can be used to build networks containing devices spread over a small geographical area.
To understand some of the principles behind the physical transmission of information, let us consider the simple
case of an electrical wire that is used to transmit bits. Assume that the two communicating hosts want to transmit
one thousand bits per second. To transmit these bits, the two hosts can agree on the following rules :
 On the sender side :
 set the voltage on the electrical wire at +5V during one millisecond to transmit a bit set to 1
 set the voltage on the electrical wire at -5V during one millisecond to transmit a bit set to 0
 On the receiver side :
 every millisecond, record the voltage applied on the electrical wire. If the voltage is set to +5V,
record the reception of bit 1. Otherwise, record the reception of bit 0
This transmission scheme has been used in some early networks. We use it as a basis to understand how hosts communicate. From a Computer Science viewpoint, dealing with voltages is unusual. Computer scientists frequently
rely on models that enable them to reason about the issues that they face without having to consider all implementation details. The physical transmission scheme described above can be represented by using a time-sequence
diagram.
A time-sequence diagram describes the interactions between communicating hosts. By convention, the communicating hosts are represented in the left and right parts of the diagram while the electrical link occupies the middle
of the diagram. In such a time-sequence diagram, time flows from the top to the bottom of the diagram. The transmission of one bit of information is represented by three arrows. Starting from the left, the first horizontal arrow
represents the request to transmit one bit of information. This request is represented by using a primitive which can
be considered as a kind of procedure call. This primitive has one parameter (the bit being transmitted) and a name
(DATA.request in this example). By convention, all primitives that are named something.request correspond to a
request to transmit some information. The dashed arrow indicates the transmission of the corresponding electrical
signal on the wire. Electrical and optical signals do not travel instantaneously. The diagonal dashed arrow indicates that it takes some time for the electrical signal to be transmitted from Host A to Host B. Upon reception of the
electrical signal, the electronics on Host Bs network interface detects the voltage and converts it into a bit. This
bit is delivered as a DATA.indication primitive. All primitives that are named something.indication correspond
to the reception of some information. The dashed lines also represents the relationship between two (or more)
primitives. Such a time-sequence diagram provides information about the ordering of the different primitives, but
the distance between two primitives does not represent a precise amount of time.
Host A
Physical link
Host B
DATA.req(0)
0
DATA.ind(0)
Time-sequence diagrams are usual when trying to understand the characteristics of a given communication
scheme. When considering the above transmission scheme, is it useful to evaluate whether this scheme allows
the two communicating hosts to reliably exchange information ? A digital transmission will be considered as
reliable when a sequence of bits that is transmitted by a host is received correctly at the other end of the wire. In
practice, achieving perfect reliability when transmitting information using the above scheme is difficult. Several
problems can occur with such a transmission scheme.
The first problem is that electrical transmission can be affected by electromagnetic interferences. These interferences can have various sources including natural phenomenons like thunderstorms, variations of the magnetic
field, but also can be caused by interference with other electrical signals such as interference from neighboring
cables, interferences from neighboring antennas, ... Due to all these interferences, there is unfortunately no guarantee that when a host transmit one bit on a wire, the same bit is received at the other end. This is illustrated in the
figure below where a DATA.request(0) on the left host leads to a Data.indication(1) on the right host.
Host A
Physical link
Host B
DATA.req(0)
DATA.ind(1)
With the above transmission scheme, a bit is transmitted by setting the voltage on the electrical cable to a specific
value during some period of time. We have seen that due to electromagnetic interferences, the voltage measured
by the receiver can differ from the voltage set by the transmitter. This is the main cause of transmission errors.
However, this is not the only type of problem that can occur. Besides defining the voltages for bits 0 and 1, the
above transmission scheme also specifies the duration of each bit. If one million bits are sent every second, then
each bit lasts 1 microsecond. On each host, the transmission (resp. the reception) of each bit is triggered by a local
clock having a 1 MHz frequency. These clocks are the second source of problems when transmitting bits over
a wire. Although the two clocks have the same specification, they run on different hosts, possibly at a different
temperature and with a different source of energy. In practice, it is possible that the two clocks do not operate at
exactly the same frequency. Assume that the clock of the transmitting host operates at exactly 1000000 Hz while
the receiving clock operates at 999999 Hz. This is a very small difference between the two clocks. However,
when using the clock to transmit bits, this difference is important. With its 1000000 Hz clock, the transmitting
host will generate one million bits during a period of one second. During the same period, the receiving host
will sense the wire 999999 times and thus will receive one bit less than the bits originally transmitted. This small
difference in clock frequencies implies that bits can disappear during their transmission on an electrical cable.
This is illustrated in the figure below.
Host A
Physical link
Host B
DATA.req(0)
DATA.ind(0)
DATA.req(0)
DATA.req(1)
DATA.ind(1)
A similar reasoning applies when the clock of the sending host is slower than the clock of the receiving host. In
this case, the receiver will sense more bits than the bits that have been transmitted by the sender. This is illustrated
in the figure below where the second bit received on the right was not transmitted by the left host.
Host A
Physical link
Host B
DATA.req(0)
DATA.ind(0)
DATA.ind(0)
DATA.req(1)
DATA.ind(1)
From a Computer Science viewpoint, the physical transmission of information through a wire is often considered
as a black box that allows to transmit bits. This black box is often referred to as the physical layer service
and is represented by using the DATA.request and DATA.indication primitives introduced earlier. This physical
layer service facilitates the sending and receiving of bits. This service abstracts the technological details that are
involved in the actual transmission of the bits as an electromagnetic signal. However, it is important to remember
that the physical layer service is imperfect and has the following characteristics :
 the Physical layer service may change, e.g. due to electromagnetic interferences, the value of a bit being
transmitted
 the Physical layer service may deliver more bits to the receiver than the bits sent by the sender
 the Physical layer service may deliver fewer bits to the receiver than the bits sent by the sender
Many other types of encodings have been defined to transmit information over an electrical cable. All physical
layers are able to send and receive physical symbols that represent values 0 and 1. However, for various reasons
that are outside the scope of this chapter, several physical layers exchange other physical symbols as well. For
example, the Manchester encoding used in several physical layers can send four different symbols. The Manchester encoding is a differential encoding scheme in which time is divided into fixed-length periods. Each period is
divided in two halves and two different voltage levels can be applied. To send a symbol, the sender must set one
of these two voltage levels during each half period. To send a 1 (resp. 0), the sender must set a high (resp. low)
voltage during the first half of the period and a low (resp. high) voltage during the second half. This encoding
ensures that there will be a transition at the middle of each period and allows the receiver to synchronise its clock
to the senders clock. Apart from the encodings for 0 and 1, the Manchester encoding also supports two additional
symbols : InvH and InvB where the same voltage level is used for the two half periods. By definition, these two
symbols cannot appear inside a frame which is only composed of 0 and 1. Some technologies use these special
symbols as markers for the beginning or end of frames.
same transmission medium to exchange bits. Being able to exchange bits is important as virtually any information
can be encoded as a sequence of bits. Electrical engineers are used to processing streams of bits, but computer
scientists usually prefer to deal with higher level concepts. A similar issue arises with file storage. Storage devices
such as hard-disks also store streams of bits. There are hardware devices that process the bit stream produced by
a hard-disk, but computer scientists have designed filesystems to allow applications to easily access such storage
devices. These filesystems are typically divided into several layers as well. Hard-disks store sectors of 512 bytes
or more. Unix filesystems group sectors in larger blocks that can contain data or inodes representing the structure
of the filesystem. Finally, applications manipulate files and directories that are translated in blocks, sectors and
eventually bits by the operating system.
Computer networks use a similar approach. Each layer provides a service that is built above the underlying layer
and is closer to the needs of the applications. The datalink layer builds upon the service provided by the physical
layer. We will see that it also contains several functions.
Bit stuffing reserves the 01111110 bit string as the frame boundary marker and ensures that there will never be
six consecutive 1 symbols transmitted by the physical layer inside a frame. With bit stuffing, a frame is sent as
follows. First, the sender transmits the marker, i.e. 01111110. Then, it sends all the bits of the frame and inserts
an additional bit set to 0 after each sequence of five consecutive 1 bits. This ensures that the sent frame never
contains a sequence of six consecutive bits set to 1. As a consequence, the marker pattern cannot appear inside the
frame sent. The marker is also sent to mark the end of the frame. The receiver performs the opposite to decode a
received frame. It first detects the beginning of the frame thanks to the 01111110 marker. Then, it processes the
received bits and counts the number of consecutive bits set to 1. If a 0 follows five consecutive bits set to 1, this bit
is removed since it was inserted by the sender. If a 1 follows five consecutive bits sets to 1, it indicates a marker if
it is followed by a bit set to 0. The table below illustrates the application of bit stuffing to some frames.
Original frame
0001001001001001001000011
0110111111111111111110010
01111110
Transmitted frame
01111110000100100100100100100001101111110
01111110011011111011111011111011001001111110
0111111001111101001111110
For example, consider the transmission of 0110111111111111111110010. The sender will first send the 01111110
marker followed by 011011111. After these five consecutive bits set to 1, it inserts a bit set to 0 followed by 11111.
A new 0 is inserted, followed by 11111. A new 0 is inserted followed by the end of the frame 110010 and the
01111110 marker.
Bit stuffing increases the number of bits required to transmit each frame. The worst case for bit stuffing is of course
a long sequence of bits set to 1 inside the frame. If transmission errors occur, stuffed bits or markers can be in
error. In these cases, the frame affected by the error and possibly the next frame will not be correctly decoded by
the receiver, but it will be able to resynchronize itself at the next valid marker.
Bit stuffing can be easily implemented in hardware. However, implementing it in software is difficult given the
complexity of performing bit manipulations in software. Software implementations prefer to process characters
than bits, software-based datalink layers usually use character stuffing. This technique operates on frames that
contain an integer number of characters. In computer networks, characters are usually encoded by relying on
the ASCII table. This table defines the encoding of various alphanumeric characters as a sequence of bits. RFC
20 provides the ASCII table that is used by many protocols on the Internet. For example, the table defines the
following binary representations :
 A : 1000011 b
 0 : 0110000 b
 z : 1111010 b
 @ : 1000000 b
 space : 0100000 b
In addition, the ASCII table also defines several non-printable or control characters. These characters were designed to allow an application to control a printer or a terminal. These control characters include CR and LF, that
are used to terminate a line, and the BEL character which causes the terminal to emit a sound.
 NUL: 0000000 b
 BEL: 0000111 b
 CR : 0001101 b
 LF : 0001010 b
 DLE: 0010000 b
 STX: 0000010 b
 ETX: 0000011 b
Some characters are used as markers to delineate the frame boundaries. Many character stuffing techniques use
the DLE, STX and ETX characters of the ASCII character set. DLE STX (resp. DLE ETX) is used to mark the
beginning (end) of a frame. When transmitting a frame, the sender adds a DLE character after each transmitted
DLE character. This ensures that none of the markers can appear inside the transmitted frame. The receiver
detects the frame boundaries and removes the second DLE when it receives two consecutive DLE characters. For
10
example, to transmit frame 1 2 3 DLE STX 4, a sender will first send DLE STX as a marker, followed by 1 2 3
DLE. Then, the sender transmits an additional DLE character followed by STX 4 and the DLE ETX marker.
Original frame
1234
1 2 3 DLE STX 4
DLE STX DLE ETX
Transmitted frame
DLE STX 1 2 3 4 DLE ETX
DLE STX 1 2 3 DLE DLE STX 4 DLE ETX
DLE STX DLE DLE STX DLE DLE ETX DLE ETX
Character stuffing , like bit stuffing, increases the length of the transmitted frames. For character stuffing, the worst
frame is a frame containing many DLE characters. When transmission errors occur, the receiver may incorrectly
decode one or two frames (e.g. if the errors occur in the markers). However, it will be able to resynchronise itself
with the next correctly received markers.
Bit stuffing and character stuffing allow to recover frames from a stream of bits or bytes. This framing mechanism
provides a richer service than the physical layer. Through the framing service, one can send and receive complete
frames. This framing service can also be represented by using the DATA.request and DATA.indication primitives.
This is illustrated in the figure below, assuming hypothetical frames containing four useful bit and one bit of
framing for graphical reasons.
Framing-A
DATA.req(1...1)
Phys-A
Phys-B
Framing-B
DATA.req(0)
0
DATA.ind(0)
DATA.req(1)
1
DATA.ind(1)
DATA.req(1)
1
DATA.ind(1)
DATA.req(0)
0
DATA.ind(0)
DATA.ind(1...1)
We can now build upon the framing mechanism to allow the hosts to exchange frames containing an integer
number of bits or bytes. Once the framing problem has been solved, we can focus our designing a technique that
allows to reliably exchange frames.
Recovering from transmission errors
In this section, we develop a reliable datalink protocol running above the physical layer service. To design this
protocol, we first assume that the physical layer provides a perfect service. We will then develop solutions to
recover from the transmission errors.
The datalink layer is designed to send and receive frames on behalf of a user. We model these interactions by using
the DATA.req and DATA.ind primitives. However, to simplify the presentation and to avoid confusion between a
DATA.req primitive issued by the user of the datalink layer entity, and a DATA.req issued by the datalink layer
entity itself, we will use the following terminology :
 the interactions between the user and the datalink layer entity are represented by using the classical
DATA.req and the DATA.ind primitives
 the interactions between the datalink layer entity and the framing sublayer are represented by using send
instead of DATA.req and recvd instead of DATA.ind
When running on top of a perfect framing sublayer, a datalink entity can simply issue a send(SDU) upon arrival of
a DATA.req(SDU). Similarly, the receiver issues a DATA.ind(SDU) upon receipt of a recvd(SDU). Such a simple
11
protocol is sufficient when a single SDU is sent. This is illustrated in the figure below.
Host A
Host B
DATA.req(SDU)
Frame(SDU)
DATA.ind(SDU)
Unfortunately, this is not always sufficient to ensure a reliable delivery of the SDUs. Consider the case where a
client sends tens of SDUs to a server. If the server is faster that the client, it will be able to receive and process all
the segments sent by the client and deliver their content to its user. However, if the server is slower than the client,
problems may arise. The datalink entity contains buffers to store SDUs that have been received as a Data.request
but have not yet been sent. If the application is faster than the physical link, the buffer may become full. At this
point, the operating system suspends the application to let the datalink entity empty its transmission queue. The
datalink entity also uses a buffer to store the received frames that have not yet been processed by the application.
If the application is slow to process the data, this buffer may overflow and the datalink entity will not able to
accept any additional frame. The buffers of the datalink entity have a limited size and if they overflow, the arriving
frames will be discarded, even if they are correct.
To solve this problem, a reliable protocol must include a feedback mechanism that allows the receiver to inform
the sender that it has processed a frame and that another one can be sent. This feedback is required even though
there are no transmission errors. To include such a feedback, our reliable protocol must process two types of
frames :
 data frames carrying a SDU
 control frames carrying an acknowledgment indicating that the previous frames was processed correctly
These two types of frames can be distinguished by dividing the frame in two parts :
 the header that contains one bit set to 0 in data frames and set to 1 in control frames
 the payload that contains the SDU supplied by the application
The datalink entity can then be modelled as a finite state machine, containing two states for the receiver and two
states for the sender. The figure below provides a graphical representation of this state machine with the sender
above and the receiver below.
12
The above FSM shows that the sender has to wait for an acknowledgement from the receiver before being able to
transmit the next SDU. The figure below illustrates the exchange of a few frames between two hosts.
Host A
Host B
DATA.req(a)
D(a)
DATA.ind(a)
C(OK)
DATA.req(b)
D(b)
DATA.ind(b)
C(OK)
13
error detection code matches the computer error detection code. If they match, the frame is considered to be valid.
Many error detection schemes exist and entire books have been written on the subject. A detailed discussion of
these techniques is outside the scope of this book, and we will only discuss some examples to illustrate the key
principles.
To understand error detection codes, let us consider two devices that exchange bit strings containing N bits. To
allow the receiver to detect a transmission error, the sender converts each string of N bits into a string of N+r
bits. Usually, the r redundant bits are added at the beginning or the end of the transmitted bit string, but some
techniques interleave redundant bits with the original bits. An error detection code can be defined as a function
that computes the r redundant bits corresponding to each string of N bits. The simplest error detection code is the
parity bit. There are two types of parity schemes : even and odd parity. With the even (resp. odd) parity scheme,
the redundant bit is chosen so that an even (resp. odd) number of bits are set to 1 in the transmitted bit string of
N+r bits. The receiver can easily recompute the parity of each received bit string and discard the strings with an
invalid parity. The parity scheme is often used when 7-bit characters are exchanged. In this case, the eighth bit is
often a parity bit. The table below shows the parity bits that are computed for bit strings containing three bits.
3 bits string
000
001
010
100
111
110
101
011
Odd parity
1
0
0
0
0
1
1
1
Even parity
0
1
1
1
1
0
0
0
The parity bit allows a receiver to detect transmission errors that have affected a single bit among the transmitted
N+r bits. If there are two or more bits in error, the receiver may not necessarily be able to detect the transmission
error. More powerful error detection schemes have been defined. The Cyclical Redundancy Checks (CRC) are
widely used in datalink layer protocols. An N-bits CRC can detect all transmission errors affecting a burst of
less than N bits in the transmitted frame and all transmission errors that affect an odd number of bits. Additional
details about CRCs may be found in [Williams1993].
It is also possible to design a code that allows the receiver to correct transmission errors. The simplest error
correction code is the triple modular redundancy (TMR). To transmit a bit set to 1 (resp. 0), the sender transmits
111 (resp. 000). When there are no transmission errors, the receiver can decode 111 as 1. If transmission errors
have affected a single bit, the receiver performs majority voting as shown in the table below. This scheme allows
the receiver to correct all transmission errors that affect a single bit.
Received bits
000
001
010
100
111
110
101
011
Decoded bit
0
0
0
0
1
1
1
1
Other more powerful error correction codes have been proposed and are used in some applications. The Hamming
Code is a clever combination of parity bits that provides error detection and correction capabilities.
Reliable protocols use error detection schemes, but none of the widely used reliable protocols rely on error correction schemes. To detect errors, a frame is usually divided into two parts :
 a header that contains the fields used by the reliable protocol to ensure reliable delivery. The header contains
a checksum or Cyclical Redundancy Check (CRC) [Williams1993] that is used to detect transmission errors
 a payload that contains the user data
Some headers also include a length field, which indicates the total length of the frame or the length of the payload.
The simplest error detection scheme is the checksum. A checksum is basically an arithmetic sum of all the bytes
that a frame is composed of. There are different types of checksums. For example, an eight bit checksum can be
14
computed as the arithmetic sum of all the bytes of (both the header and trailer of) the frame. The checksum is
computed by the sender before sending the frame and the receiver verifies the checksum upon frame reception. The
receiver discards frames received with an invalid checksum. Checksums can be easily implemented in software,
but their error detection capabilities are limited. Cyclical Redundancy Checks (CRC) have better error detection
capabilities [SGP98], but require more CPU when implemented in software.
Note: Checksums, CRCs, ...
Most of the protocols in the TCP/IP protocol suite rely on the simple Internet checksum in order to verify that a
received packet has not been affected by transmission errors. Despite its popularity and ease of implementation,
the Internet checksum is not the only available checksum mechanism. Cyclical Redundancy Checks (CRC) are
very powerful error detection schemes that are used notably on disks, by many datalink layer protocols and file
formats such as zip or png. They can easily be implemented efficiently in hardware and have better error-detection
capabilities than the Internet checksum [SGP98] . However, CRCs are sometimes considered to be too CPUintensive for software implementations and other checksum mechanisms are preferred. The TCP/IP community
chose the Internet checksum, the OSI community chose the Fletcher checksum [Sklower89] . Nowadays there are
efficient techniques to quickly compute CRCs in software [Feldmeier95]
Host A
DATA.req(a)
start timer
Host B
D(a)
DATA.ind(a)
C(OK)
cancel timer
DATA.req(b)
start timer
D(b)
timer expires
D(b)
DATA.ind(b)
C(OK)
Unfortunately, retransmission timers alone are not sufficient to recover from segment losses. Let us consider, as
an example, the situation depicted below where an acknowledgement is lost. In this case, the sender retransmits
the data segment that has not been acknowledged. Unfortunately, as illustrated in the figure below, the receiver
considers the retransmission as a new segment whose payload must be delivered to its user.
15
Host A
DATA.req(a)
start timer
Host B
D(a)
DATA.ind(a)
C(OK)
cancel timer
DATA.req(b)
start timer
D(b)
DATA.ind(b)
C(OK)
timer expires
D(b)
DATA.ind(b) !!!!!
C(OK)
To solve this problem, datalink protocols associate a sequence number to each data frame. This sequence number
is one of the fields found in the header of data frames. We use the notation D(x,...) to indicate a data frame whose
sequence number field is set to value x. The acknowledgements also contain a sequence number indicating the data
frames that it is acknowledging. We use OKx to indicate an acknowledgement frame that confirms the reception
of D(x,...). The sequence number is encoded as a bit string of fixed length. The simplest reliable protocol is the
Alternating Bit Protocol (ABP).
The Alternating Bit Protocol uses a single bit to encode the sequence number. It can be implemented easily. The
sender and the receiver only require a four-state Finite State Machine.
Host B
D(0,a)
DATA.ind(a)
C(OK0)
cancel timer
DATA.req(b)
start timer
D(1,b)
DATA.ind(b)
C(OK1)
cancel timer
DATA.req(c)
start timer
D(0,c)
DATA.ind(c)
C(OK0)
cancel timer
The Alternating Bit Protocol can recover from the losses of data or control frames. This is illustrated in the two
figures below. The first figure shows the loss of one data segment.
17
Host A
DATA.req(a)
start timer
Host B
D(0,a)
DATA.ind(a)
C(OK0)
cancel timer
DATA.req(b)
start timer
D(1,b)
timer expires
D(1,b)
DATA.ind(b)
C(OK1)
Host B
D(0,a)
DATA.ind(a)
C(OK0)
cancel timer
DATA.req(b)
start timer
D(1,b)
DATA.ind(b)
C(OK1)
timer expires
D(1,b)
Duplicate frame
ignored
C(OK1)
The Alternating Bit Protocol can recover from transmission errors and frame losses. However, it has one important
drawback. Consider two hosts that are directly connected by a 50 Kbits/sec satellite link that has a 250 milliseconds
propagation delay. If these hosts send 1000 bits frames, then the maximum throughput that can be achieved by the
alternating bit protocol is one frame every 20 + 250 + 250 = 520 milliseconds if we ignore the transmission time
of the acknowledgement. This is less than 2 Kbits/sec !
Go-back-n and selective repeat
To overcome the performance limitations of the alternating bit protocol, reliable protocols rely on pipelining. This
technique allows a sender to transmit several consecutive frames without being forced to wait for an acknowledgement after each frame. Each data frame contains a sequence number encoded in an n bits field.
Pipelining allows the sender to transmit frames at a higher rate. However this higher transmission rate may
18
19
20
21
22
23
24
in the returned acknowledgement the list of the sequence numbers of all frames that have already been received.
Such acknowledgements are sometimes called selective acknowledgements. This is illustrated in the figure above.
In the figure above, when the sender receives C(OK,0,[2]), it knows that all frames up to and including D(0,...)
have been correctly received. It also knows that frame D(2,...) has been received and can cancel the retransmission
timer associated to this frame. However, this frame should not be removed from the sending buffer before the
reception of a cumulative acknowledgement (C(OK,2) in the figure above) that covers this frame.
Note: Maximum window size with go-back-n and selective repeat
A reliable protocol that uses n bits to encode its sequence number can send up to 2 successive frames. However, to
ensure a reliable delivery of the frames, go-back-n and selective repeat cannot use a sending window of 2 frames.
Consider first go-back-n and assume that a sender sends 2 frames. These frames are received in-sequence by the
destination, but all the returned acknowledgements are lost. The sender will retransmit all frames. These frames
will all be accepted by the receiver and delivered a second time to the user. It is easy to see that this problem
can be avoided if the maximum size of the sending window is 2  1 frames. A similar problem occurs with
selective repeat. However, as the receiver accepts out-of-sequence frames, a sending window of 2  1 frames
is not sufficient to ensure a reliable delivery. It can be easily shown that to avoid this problem, a selective repeat
25
R1
R2
Before explaining the network layer in detail, it is useful to remember the characteristics of the service provided
by the datalink layer. There are many variants of the datalink layer. Some provide a reliable service while others
do not provide any guarantee of delivery. The reliable datalink layer services are popular in environments such
as wireless networks were transmission errors are frequent. On the other hand, unreliable services are usually
used when the physical layer provides an almost reliable service (i.e. only a negligible fraction of the frames are
affected by transmission errors). Such almost reliable services are frequently in wired and optical networks. In
this chapter, we will assume that the datalink layer service provides an almost reliable service since this is both
the most general one and also the most widely deployed one.
26
Even if we only consider the point-to-point datalink layers, there is an important characteristics of these layers that
we cannot ignore. No datalink layer is able to send frames of unlimited size. Each datalink layer is characterized
by a maximum frame size. There are more than a dozen different datalink layers and unfortunately most of them
use a different maximum frame size. This heterogeneity in the maximum frame sizes will cause problems when
we will need to exchange data between hosts attached to different types of datalink layers.
As a first step, let us assume that we only need to exchange small amount of data. In this case, there is no issue
with the maximum length of the frames. However, there are other more interesting problems that we need to
tackle. To understand these problems, let us consider the network represented in the figure below.
A
R1
R3
R2
R4
R5
C
This network contains two types of devices. The end hosts, represented as a small workstation and the routers,
represented as boxes with three arrows. An endhost is a device which is able to send and receive data for its own
usage in contrast with routers that most of the time forward data towards their final destination. Routers have
multiple links to neighboring routers or endhosts. Endhosts are usually attached via a single link to the network.
Nowadays, with the growth of wireless networks, more and more endhosts are equipped with several physical
interfaces. These endhosts are often called multihomed. Still, using several interfaces at the same time often leads
to practical issues that are beyond the scope of this document. For this reason, we will only consider single-homed
hosts in this ebook.
To understand the key principles behind the operation of a network, let us analyse all the operations that need to
be performed to allow host A in the above network to send one byte to host B. Thanks to the datalink layer used
above the A-R1 link, host A can easily send a byte to router R1 inside a frame. However, upon reception of this
frame, router R1 needs to understand that the byte is destined to host B and not to itself. This is the objective of
the network layer.
The network layer enables the transmission of information between hosts that are not directly connected through
intermediate routers. This transmission is carried out by putting the information to be transmitted inside a data
structure which is called a packet. Like a frame that contains useful data and control information, a packet also
contains useful data and control information. An important issue in the network layer is the ability to identify a
node (host or router) inside the network. This identification is performed by associating an address to each node.
An address is usually represented as a sequence of bits. Most networks use fixed-length addresses. At this stage,
let us simply assume that each of the nodes in the above network has an address which corresponds to the binary
representation on its name on the figure.
To send one byte of information to host B, host A needs to place this information inside a packet. In addition to the
data being transmitted, the packet must also contain either the addresses of the source and the destination nodes
or information that indicates the path that needs to be followed to reach the destination.
There are two possible organisations for the network layer :
 datagram
 virtual circuits
27
will never reach its final destination. As in the black hole case, the destination is not reachable from all sources in
the network. However, in practice the loop problem is worse than the black hole problem because when a packet is
caught in a forwarding loop, it unnecessarily consumes bandwidth. In the black hole case, the problematic packet
is quickly discarded. We will see later that network layer protocols include techniques to minimize the impact of
such forwarding loops.
Any solution which is used to compute the forwarding tables of a network must ensure that all destinations are
reachable from any source. This implies that it must guarantee the absence of black holes and forwarding loops.
The forwarding tables and the precise format of the packets that are exchanged inside the network are part of
the data plane of the network. This data plane contains all the protocols and algorithms that are used by hosts
and routers to create and process the packets that contain user data. On high-end routers, the data plane is often
implemented in hardware for performance reasons.
Besides the data plane, a network is also characterized by its control plane. The control plane includes all the
protocols and algorithms (often distributed) that are used to compute the forwarding tables that are installed on
all routers inside the network. While there is only one possible data plane for a given networking technology,
different networks using the same technology may use different control planes. The simplest control plane for
a network is always to compute manually the forwarding tables of all routers inside the network. This simple
control plane is sufficient when the network is (very) small, usually up to a few routers.
In most networks, manual forwarding tables are not a solution for two reasons. First, most networks are too large
to enable a manual computation of the forwarding tables. Second, with manually computed forwarding tables,
it is very difficult to deal with link and router failures. Networks need to operate 24h a day, 365 days per year.
During the lifetime of a network, many events can affect the routers and links that it contains. Link failures are
regular events in deployed networks. Links can fail for various reasons, including electromagnetic interference,
fiber cuts, hardware or software problems on the terminating routers, ... Some links also need to be added to the
network or removed because their utilisation is too low or their cost is too high. Similarly, routers also fail. There
are two types of failures that affect routers. A router may stop forwarding packets due to hardware or software
problem (e.g. due to a crash of its operating system). A router may also need to be halted from time to time (e.g.
to upgrade its operating system to fix some bugs). These planned and unplanned events affect the set of links and
routers that can be used to forward packets in the network. Still, most network users expect that their network will
continue to correctly forward packets despite all these events. With manually computed forwarding tables, it is
usually impossible to precompute the forwarding tables while taking into account all possible failure scenarios.
An alternative to manually computed forwarding tables is to use a network management platform that tracks the
network status and can push new forwarding tables on the routers when it detects any modification to the network
topology. This solution gives some flexibility to the network managers in computing the paths inside their network.
However, this solution only works if the network management platform is always capable of reaching all routers
even when the network topology changes. This may require a dedicated network that allows the management
platform to push information on the forwarding tables.
Another interesting point that is worth being discussed is when the forwarding tables are computed. A widely
used solution is to compute the entries of the forwarding tables for all destinations on all routers. This ensures that
each router has a valid route towards each destination. These entries can be updated when an event occurs and the
network topology changes. A drawback of this approach is that the forwarding tables can become large in large
networks since each router must maintain one entry for each destination at all times inside its forwarding table.
Some networks use the arrival of packets as the trigger to compute the corresponding entries in the forwarding
tables. Several technologies have been built upon this principle. When a packet arrives, the router consults its
forwarding table to find a path towards the destination. If the destination is present in the forwarding table, the
packet is forwarded. Otherwise, the router needs to find a way to forward the packet and update its forwarding
table.
Computing forwarding tables
Several techniques to update the forwarding tables upon the arrival of a packet have been used in deployed networks. In this section, we briefly present the principles that underly three of these techniques.
The first technique assumes that the underlying network topology is a tree. A tree is the simplest network to be
considered when forwarding packets. The main advantage of using a tree is that there is only one path between
29
any pair of nodes inside the network. Since a tree does not contain any cycle, it is impossible to have forwarding
loops in a tree-shaped network.
In a tree-shaped network, it is relatively simple for each node to automatically compute its forwarding table by
inspecting the packets that it receives. For this, each node uses the source and destination addresses present inside
each packet. The source address allows to learn the location of the different sources inside the network. Each
source has a unique address. When a node receives a packet over a given interface, it learns that the source
(address) of this packet is reachable via this interface. The node maintains a data structure that maps each known
source address to an incoming interface. This data structure is often called the port-address table since it indicates
the interface (or port) to reach a given address. Learning the location of the sources is not sufficient, nodes also
need to forward packets towards their destination. When a node receives a packet whose destination address is
already present inside its port-address table, it simply forwards the packet on the interface listed in the port-address
table. In this case, the packet will follow the port-address table entries in the downstream nodes and will reach
the destination. If the destination address is not included in the port-address table, the node simply forwards the
packet on all its interfaces, except the interface from which the packet was received. Forwarding a packet over
all interfaces is usually called broadcasting in the terminology of computer networks. Sending the packet over all
interfaces except one is a costly operation since the packet will be sent over links that do not reach the destination.
Given the tree-shape of the network, the packet will explore all downstream branches of the tree and will thus
finally reach its destination. In practice, the broadcasting operation does not occur too often and its cost is limited.
To understand the operation of the port-address table, let us consider the example network shown in the figure
below. This network contains three hosts : A, B and C and five nodes, R1 to R5. When the network boots, all the
forwarding tables of the nodes are empty.
30
R1
R2
R3
R4
R5
Host A sends a packet towards B. When receiving this packet, R1 learns that A is reachable via its North interface.
Since it does not have an entry for destination B in its port-address table, it forwards the packet to both R2 and
R3. When R2 receives the packet, it updates its own forwarding table and forward the packet to C. Since C is not
the intended recipient, it simply discards the received packet. Node R3 also received the packet. It learns that A is
reachable via its North interface and broadcasts the packet to R4 and R5. R5 also updates its forwarding table and
finally forwards it to destination B.Let us now consider what happens when B sends a reply to A. R5 first learns
that B is attached to its South port. It then consults its port-address table and finds that A is reachable via its North
interface. The packet is then forwarded hop-by-hop to A without any broadcasting. If C sends a packet to B, this
31
packet will reach R1 that contains a valid forwarding entry in its forwarding table.
By inspecting the source and destination addresses of packets, network nodes can automatically derive their forwarding tables. As we will discuss later, this technique is used in Ethernet networks. Despite being widely used,
it has two important drawbacks. First, packets sent to unknown destinations are broadcasted in the network even
if the destination is not attached to the network. Consider the transmission of ten packets destined to Z in the
network above. When a node receives a packet towards this destination, it can only broadcast the packet. Since
Z is not attached to the network, no node will ever receive a packet whose source is Z to update its forwarding
table. The second and more important problem is that few networks have a tree-shaped topology. It is interesting
to analyze what happens when a port-address table is used in a network that contains a cycle. Consider the simple
network shown below with a single host.
32
R1
R2
R3
Assume that the network has started and all port-station and forwarding tables are empty. Host A sends a packet
towards B. Upon reception of this packet, R1 updates its port-address table. Since B is not present in the portaddress table, the packet is broadcasted. Both R2 and R3 receive a copy of the packet sent by A. They both update
33
their port-address table. Unfortunately, they also both broadcast the received packet. B receives a first copy of the
packet, but R3 and R2 receive it again. R3 will then broadcast this copy of the packet to B and R1 while R2 will
broadcast its copy to R1. Although B has already received two copies of the packet, it is still inside the network
and will continue to loop. Due to the presence of the cycle, a single packet towards an unknown destination
generates copies of this packet that loop and will saturate the network bandwidth. Network operators who are
using port-address tables to automatically compute the forwarding tables also use distributed algorithms to ensure
that the network topology is always a tree.
Another technique can be used to automatically compute forwarding tables. It has been used in interconnecting
Token Ring networks and in some wireless networks. Intuitively, Source routing enables a destination to automatically discover the paths from a given source towards itself. This technique requires nodes to change some
information inside some packets. For simplicity, let us assume that the data plane supports two types of packets :
 the data packets
 the control packets
Data packets are used to exchange data while control packets are used to discover the paths between endhosts.
With Source routing, network nodes can be kept as simple as possible and all the complexity is placed on the
endhosts. This is in contrast with the previous technique where the nodes had to maintain a port-address and
a forwarding table while the hosts simply sent and received packets. Each node is configured with one unique
address and there is one identifier per outgoing link. For simplicity and to avoid cluttering the figures with those
identifiers, we will assume that each node uses as link identifiers north, west, south, ... In practice, a node would
associate one integer to each outgoing link.
34
R1
R2
R3
R4
35
In the network above, node R2 is attached to two outgoing links. R2 is connected to both R1 and R3. R2 can
easily determine that it is connected to these two nodes by exchanging packets with them or observing the packets
that it receives over each interface. Assume for example that when a host or node starts, it sends a special control
packet over each of its interfaces to advertise its own address to its neighbors. When a host or node receives such a
packet, it automatically replies with its own address. This exchange can also be used to verify whether a neighbor,
either node or host, is still alive. With source routing, the data plane packets include a list of identifiers. This list
is called a source route and indicates the path to be followed by the packet as a sequence of link identifiers. When
a node receives such a data plane packet, it first checks whether the packets destination is direct neighbor. In this
case, the packet is forwarded to the destination. Otherwise, the node extracts the next address from the list and
forwards it to the neighbor. This allows the source to specify the explicit path to be followed for each packet. For
example, in the figure above there are two possible paths between A and B. To use the path via R2, A would send a
packet that contains R1,R2,R3 as source route. To avoid going via R2, A would place R1,R3 as the source route in
its transmitted packet. If A knows the complete network topology and all link identifiers, it can easily compute the
source route towards each destination. If needed, it could even use different paths, e.g. for redundancy, to reach a
given destination. However, in a real network hosts do not usually have a map of the entire network topology.
In networks that rely on source routing, hosts use control packets to automatically discover the best path(s). In
addition to the source and destination addresses, control packets contain a list that records the intermediate nodes.
This list is often called the record route because it allows to record the route followed by a given packet. When a
node receives a control packet, it first checks whether its address is included in the record route. If yes, the control
packet is silently discarded. Otherwise, it adds its own address to the record route and forwards the packet to all
its interfaces, except the interface over which the packet has been received. Thanks to this, the control packet will
be able to explore all paths between a source and a given destination.
For example, consider again the network topology above. A sends a control packet towards B. The initial record
route is empty. When R1 receives the packet, it adds its own address to the record route and forwards a copy to R2
and another to R3. R2 receives the packet, adds itself to the record route and forwards it to R3. R3 receives two
copies of the packet. The first contains the [R1,R2] record route and the second [R1]. In the end, B will receive
two control packets containing [R1,R2,R3,R4] and [R1,R3,R4] as record routes. B can keep these two paths or
select the best one and discard the second. A popular heuristic is to select the record route of the first received
packet as being the best one since this likely corresponds to the shortest delay path.
With the received record route, B can send a data packet to A. For this, it simply reverses the chosen record route.
However, we still need to communicate the chosen path to A. This can be done by putting the record route inside
a control packet which is sent back to A over the reverse path. An alternative is to simply send a data packet back
to A. This packet will travel back to A. To allow A to inspect the entire path followed by the data packet, its source
route must contain all intermediate routers when it is received by A. This can be achieved by encoding the source
route using a data structure that contains an index and the ordered list of node addresses. The index always points
to the next address in the source route. It is initialized at 0 when a packet is created and incremented by each
intermediate node.
Flat or hierarchical addresses
The last, but important, point to discuss about the data plane of the networks that rely on the datagram mode is
their addressing scheme. In the examples above, we have used letters to represent the addresses of the hosts and
network nodes. In practice, all addresses are encoded as a bit string. Most network technologies use a fixed size
bit string to represent source and destination address. These addresses can be organized in two different ways.
The first organisation, which is the one that we have implicitly assumed until now, is the flat addressing scheme.
Under this scheme, each host and network node has a unique address. The unicity of the addresses is important for
the operation of the network. If two hosts have the same address, it can become difficult for the network to forward
packets towards this destination. Flat addresses are typically used in situations where network nodes and hosts
need to be able to communicate immediately with unique addresses. These flat addresses are often embedded
inside the hardware of network interface cards. The network card manufacturer creates one unique address for
each interface and this address is stored in the read-only memory of the interface. An advantage of this addressing
scheme is that it easily supports ad-hoc and mobile networks. When a host moves, it can attach to another network
and remain confident that its address is unique and enables it to communicate inside the new network.
With flat addressing the lookup operation in the forwarding table can be implemented as an exact match. The
forwarding table contains the (sorted) list of all known destination addresses. When a packet arrives, a network
36
node only needs to check whether this address is part of the forwarding table or not. In software, this is an
O(log(n)) operation if the list is sorted. In hardware, Content Addressable Memories can perform this lookup
operation efficiently, but their size is usually limited.
A drawback of the flat addressing scheme is that the forwarding tables grow linearly with the number of hosts and
nodes in the network. With this addressing scheme, each forwarding table must contain an entry that points to
every address reachable inside the network. Since large networks can contain tens of millions or more of hosts,
this is a major problem on network nodes that need to be able to quickly forward packets. As an illustration, it is
interesting to consider the case of an interface running at 10 Gbps. Such interfaces are found on high-end servers
and in various network nodes today. Assuming a packet size of 1000 bits, a pretty large and conservative number,
such interface must forward ten million packets every second. This implies that a network node that receives
packets over such a link must forward one 1000 bits packet every 100 nanoseconds. This is the same order of
magnitude as the memory access times of old DRAMs.
A widely used alternative to the flat addressing scheme is the hierarchical addressing scheme. This addressing
scheme builds upon the fact that networks usually contain much more hosts than network nodes. In this case, a
first solution to reduce the size of the forwarding tables is to create a hierarchy of addresses. This is the solution
chosen by the post office were addresses contain a country, sometimes a state or province, a city, a street and
finally a street number. When an enveloppe is forwarded by a postoffice in a remote country, it only looks at
the destination country, while a post office in the same province will look at the city information. Only the post
office responsible for a given city will look at the street name and only the postman will use the street number.
Hierarchical addresses provide a similar solution for network addresses. For example, the address of an Internet
host attached to a campus network could contain in the high-order bits an identification of the Internet Service
Provider (ISP) that serves the campus network. Then, a subsequent block of bits identifies the campus network
which is one of the customers from the ISP. Finally, the low order bits of the address identify the host in the
campus network.
This hierarchical allocation of addresses can be applied in any type of network. In practice, the allocation of
the addresses must follow the network topology. Usually, this is achieved by dividing the addressing space in
consecutive blocks and then allocating these blocks to different parts of the network. In a small network, the
simplest solution is to allocate one block of addresses to each network node and assign the host addresses from
the attached node.
37
R1
R2
R3
R4
38
In the above figure, assume that the network uses 16 bits addresses and that the prefix 01001010 has been assigned
to the entire network. Since the network contains four routers, the network operator could assign one block
of sixty-four addresses to each router. R1 would use address 0100101000000000 while A could use address
0100101000000001. R2 could be assigned all adresses from 0100101001000000 to 0100101001111111. R4
could then use 0100101011000000 and assign 0100101011000001 to B. Other allocation schemes are possible.
For example, R3 could be allocated a larger block of addresses than R2 and R4 could use a sub-block from R3 s
address block.
The main advantage of hierarchical addresses is that it is possible to significantly reduce the size of the forwarding
tables. In many networks, the number of nodes can be several orders of magnitude smaller than the number of
hosts. A campus network may contain a few dozen of network nodes for thousands of hosts. The largest Internet
Services Providers typically contain no more than a few tens of thousands of network nodes but still serve tens or
hundreds of millions of hosts.
Despite their popularity, hierarchical addresses have some drawbacks. Their first drawback is that a lookup in
the forwarding table is more complex than when using flat addresses. For example, on the Internet, network
nodes have to perform a longest-match to forward each packet. This is partially compensated by the reduction in
the size of the forwarding tables, but the additional complexity of the lookup operation has been a difficulty to
implement hardware support for packet forwarding. A second drawback of the utilisation of hierarchical addresses
is that when a host connects for the first time to a network, it must contact one network node to determine its own
address. This requires some packet exchanges between the host and some network nodes. Furthermore, if a host
moves and is attached to another network node, its network address will change. This can be an issue with some
mobile hosts.
Dealing with heterogeneous datalink layers
Sometimes, the network layer needs to deal with heterogenous datalink layers. For example, two hosts connected
to different datalink layers exchange packets via routers that are using other types of datalink layers. Thanks to
the network layer, this exchange of packets is possible provided that each packet can be placed inside a datalink
layer frame before being transmitted. If all datalink layers support the same frame size, this is simple. When a
node receives a frame, it decapsulate the packet that it contains, checks the header and forwards it, encapsulated
inside another frame, to the outgoing interface. Unfortunately, the encapsulation operation is not always possible.
Each datalink layer is characterized by the maximum frame size that it supports. Datalink layers typically support
frames containing up to a few hundreds or a few thousands of bytes. The maximum frame size that a given datalink
layer supports depends on its underlying technology and unfortunately, most datalink layers support a different
maximum frame size. This implies that when a host sends a large packet inside a frame to its nexthop router, there
is a risk that this packet will have to traverse a link that is not capable of forwarding the packet inside a single
frame. In principle, there are three possibilities to solve this problem. We will discuss them by considering a
simpler scenario with two hosts connected to a router as shown in the figure below.
Max.
1000 bytes
Max.
500 bytes
R1
Max.
1000 bytes
R2
B
Considering in the network above that host A wants to send a 900 bytes packet (870 bytes of payload and 30 bytes
of header) to host B via router R1. Host A encapsulates this packet inside a single frame. The frame is received by
router R1 which extracts the packet. Router R1 has three possible options to process this packet.
1. The packet is too large and router R1 cannot forward it to router R2. It rejects the packet and
sends a control packet back to the source (host A) to indicate that it cannot forward packets
longer than 500 bytes (minus the packet header). The source will have to react to this control
packet by retransmitting the information in smaller packets.
39
2. The network layer is able to fragment a packet. In our example, the router could fragment the
packet in two parts. The first part contains the beginning of the payload and the second the end.
There are two possible ways to achieve this fragmentation.
1. Router R1 fragments the packet in two fragments before transmitting them to router R2. Router
R2 reassembles the two packet fragments in a larger packet before transmitting them on the link
towards host B.
2. Each of the packet fragments is a valid packet that contains a header with the source (host A)
and destination (host B) addresses. When router R2 receives a packet fragment, it treats this
packet as a regular packet and forwards it to its final destination (host B). Host B reassembles
the received fragments.
These three solutions have advantages and drawbacks. With the first solution, routers remain simple and do
not need to perform any fragmentation operation. This is important when routers are implemented mainly in
hardware. However, hosts are more complex since they need to store the packets that they produce if they need
to pass through a link that does not support large packets. This increases the buffering required on the end hosts.
Furthermore, a single large packet may potentially need to be retransmitted several times. Consider for example a
network similar to the one shown above but with four routers. Assume that the link R1->R2 supports 1000 bytes
packets, link R2->R3 800 bytes packets and link R3->R4 600 bytes packets. A host attached to R1 that sends large
packet will have to first try 1000 bytes, then 800 bytes and finally 600 bytes. Fortunately, this scenario does not
occur very often in practice and this is the reason why this solution is used in real networks.
Fragmenting packets on a per-link basis, as presented for the second solution, can minimize the transmission
overhead since a packet is only fragmented on the links where fragmentation is required. Large packets can
continue to be used downstream of a link that only accepts small packets. However, this reduction of the overhead
comes with two drawbacks. First, fragmenting packets, potentially on all links, increases the processing time
and the buffer requirements on the routers. Second, this solution leads to a longer end-to-end delay since the
downstream router has to reassemble all the packet fragments before forwarding the packet.
The last solution is a compromise between the two others. Routers need to perform fragmentation but they do not
need to reassemble packet fragments. Only the hosts need to have buffers to reassemble the received fragments.
This solution has a lower end-to-end delay and requires fewer processing time and memory on the routers.
The first solution to the fragmentation problem presented above suggests the utilization of control packets to
inform the source about the reception of a too long packet. This is only one of the functions that are performed by
the control protocol in the network layer. Other functions include :
 sending a control packet back to the source if a packet is received by a router that does not have a valid entry
in its forwarding table
 sending a control packet back to the source if a router detects that a packet is looping inside the network
 verifying that packets can reach a given destination
We will discuss these functions in more details when we will describe the protocols that are used in the network
layer of the TCP/IP protocol suite.
40
isation, each data packet contains one label 2 . A label is an integer which is part of the packet header. Network
nodes implement label switching to forward labelled data packet. Upon reception of a packet, a network nodes
consults its label forwarding table to find the outgoing interface for this packet. In contrast with the datagram
mode, this lookup is very simple. The label forwarding table is an array stored in memory and the label of the
incoming packet is the index to access this array. This implies that the lookup operation has an O(1) complexity
in contrast with other packet forwarding techniques. To ensure that on each node the packet label is an index in
the label forwarding table, each network node that forwards a packet replaces the label of the forwarded packet
with the label found in the label forwarding table. Each entry of the label forwarding table contains two pieces of
information :
 the outgoing interface for the packet
 the label for the outgoing packet
For example, consider the label forwarding table of a network node below.
index
0
1
2
3
outgoing interface
South
none
West
East
label
7
none
2
2
If this node receives a packet with label=2, it forwards the packet on its West interface and sets the label of the
outgoing packet to 2. If the received packets label is set to 3, then the packet is forwarded over the East interface
and the label of the outgoing packet is set to 2. If a packet is received with a label field set to 1, the packet is
discarded since the corresponding label forwarding table entry is invalid.
Label switching enables a full control over the path followed by packets inside the network. Consider the network
below and assume that we want to use two virtual circuits : R1->R3->R4->R2->R5 and R2->R1->R3->R4->R5.
2 We will see later a more detailed description of Multiprotocol Label Switching, a networking technology that is capable of using one or
more labels.
41
R1
R3
R2
R4
R5
To create these virtual circuits, we need to configure the label forwarding tables of all network nodes. For
simplicity, assume that a label forwarding table only contains two entries. Assume that R5 wants to receive the
packets from the virtual circuit created by R1 (resp. R2) with label=1 (label=0). R4 could use the following label
forwarding table:
index
0
1
outgoing interface
->R2
->R5
label
1
0
Since a packet received with label=1 must be forwarded to R5 with label=1, R2s label forwarding table could
contain :
index
0
1
42
outgoing interface
none
->R5
label
none
1
Two virtual circuits pass through R3. They both need to be forwarded to R4, but R4 expects label=1 for packets
belonging to the virtual circuit originated by R2 and label=0 for packets belonging to the other virtual circuit. R3
could choose to leave the labels unchanged.
index
0
1
outgoing interface
->R4
->R4
label
0
1
With the above label forwarding table, R1 needs to originate the packets that belong to the R1->R3->R4->R2->R5
with label=1. The packets received from R2 and belonging to the R2->R1->R3->R4->R5 would then use label=0
on the R1-R3 link. R1s label forwarding table could be built as follows :
index
0
1
outgoing interface
->R3
none
label
0
1
The figure below shows the path followed by the packets on the R1->R3->R4->R2->R5 path in red with on each
arrow the label used in the packets.
We will discuss later Multi-Protocol Label Switching (MPLS) as the example of a deployed networking technology
that relies on label switching. MPLS is more complex than the above description because it has been designed
to be easily integrated with datagram technologies. However, the principles remain. Asynchronous Transfer
Mode(ATM) and Frame Relay are other examples of technologies that rely on label switching.
Nowadays, most deployed networks rely on distributed algorithms, called routing protocols, to compute the forwarding tables that are installed on the network nodes. These distributed algorithms are part of the control plane.
They are usually implemented in software and are executed on the main CPU of the network nodes. There are two
main families of routing protocols : distance vector routing and link state routing. Both are capable of discovering
autonomously the network and react dynamically to topology changes.
43
When a router boots, it does not know any destination in the network and its routing table only contains itself. It
thus sends to all its neighbours a distance vector that contains only its address at a distance of 0. When a router
receives a distance vector on link l, it processes it as follows.
# V : received Vector
# l : link over which vector is received
def received(V,l):
# received vector from link l
for d in V[]
if not (d in R[]) :
# new route
R[d].cost=V[d].cost+l.cost
R[d].link=l
R[d].time=now
else :
# existing route, is the new better ?
if ( ((V[d].cost+l.cost) < R[d].cost) or ( R[d].link == l) )
# Better route or change to current route
R[d].cost=V[d].cost+l.cost
R[d].link=l
R[d].time=now
The router iterates over all addresses included in the distance vector. If the distance vector contains an address
that the router does not know, it inserts the destination inside its routing table via link l and at a distance which is
the sum between the distance indicated in the distance vector and the cost associated to link l. If the destination
was already known by the router, it only updates the corresponding entry in its routing table if either :
 the cost of the new route is smaller than the cost of the already known route ( (V[d].cost+l.cost) < R[d].cost)
 the new route was learned over the same link as the current best route towards this destination ( R[d].link
== l)
The first condition ensures that the router discovers the shortest path towards each destination. The second condition is used to take into account the changes of routes that may occur after a link failure or a change of the metric
associated to a link.
To understand the operation of a distance vector protocol, let us consider the network of five routers shown below.
44
 E sends its distance vector [E=0,D=1,A=2,C=1] to D, B and C. B can now reach A, C, D and E
 B sends its distance vector [B=0,A=1,C=1,D=2,E=1] to A, C and E. A, B, C and E can now reach all
destinations.
 A sends its distance vector [A=0,B=1,C=2,D=1,E=2] to B and D.
At this point, all routers can reach all other routers in the network thanks to the routing tables shown in the figure
below.
45
This technique is called split-horizon. With this technique, the count to infinity problem would not have happened
in the above scenario, as router A would have advertised [ = 0], since it learned all its other routes via router
D. Another variant called split-horizon with poison reverse is also possible. Routers using this variant advertise a
cost of  for the destinations that they reach via the router to which they send the distance vector. This can be
implemented by using the pseudo-code below.
Every N seconds:
for l in interfaces:
# one vector for each interface
v=Vector()
for d in R[]:
if (R[d].link != i) :
v=v+Pair(d,R[d.cost])
else:
v=v+Pair(d,infinity);
send(v)
# end for d in R[]
#end for l in interfaces
Unfortunately, split-horizon, is not sufficient to avoid all count to infinity problems with distance vector routing.
46
Consider the failure of link A-B in the network of four routers below.
47
 weight proportional to the propagation delay on the link. If all link weights are configured this way, shortest
path routing uses the paths with the smallest propagation delay.
where C is a constant larger than the highest link bandwidth in the network. If all link
  = 
weights are configured this way, shortest path routing prefers higher bandwidth paths over lower bandwidth
paths
Usually, the same weight is associated to the two directed edges that correspond to a physical link (i.e. 1  2
and 2  1). However, nothing in the link state protocols requires this. For example, if the weight is set in
function of the link bandwidth, then an asymmetric ADSL link could have a different weight for the upstream and
downstream directions. Other variants are possible. Some networks use optimisation algorithms to find the best
set of weights to minimize congestion inside the network for a given traffic demand [FRT2002].
When a link-state router boots, it first needs to discover to which routers it is directly connected. For this, each
router sends a HELLO message every N seconds on all of its interfaces. This message contains the routers
address. Each router has a unique address. As its neighbouring routers also send HELLO messages, the router
automatically discovers to which neighbours it is connected. These HELLO messages are only sent to neighbours
who are directly connected to a router, and a router never forwards the HELLO messages that they receive. HELLO
messages are also used to detect link and router failures. A link is considered to have failed if no HELLO message
has been received from the neighbouring router for a period of    seconds.
48
send(LSP,i)
else:
# LSP has already been flooded
In this pseudo-code, LSDB(r) returns the most recent LSP originating from router r that is stored in the LSDB.
newer(lsp1,lsp2) returns true if lsp1 is more recent than lsp2. See the note below for a discussion on how newer
can be implemented.
Note: Which is the most recent LSP ?
A router that implements flooding must be able to detect whether a received LSP is newer than the stored LSP.
This requires a comparison between the sequence number of the received LSP and the sequence number of the
LSP stored in the link state database. The ARPANET routing protocol [MRR1979] used a 6 bits sequence number
and implemented the comparison as follows RFC 789
def newer( lsp1, lsp2 ):
return ( ( ( lsp1.seq > lsp2.seq) and ( (lsp1.seq-lsp2.seq)<=32) ) or
( ( lsp1.seq < lsp2.seq) and ( (lsp2.seq-lsp1.seq)> 32) )
)
This comparison takes into account the modulo 26 arithmetic used to increment the sequence numbers. Intuitively,
the comparison divides the circle of all sequence numbers into two halves. Usually, the sequence number of the
received LSP is equal to the sequence number of the stored LSP incremented by one, but sometimes the sequence
numbers of two successive LSPs may differ, e.g. if one router has been disconnected from the network for some
time. The comparison above worked well until October 27, 1980. On this day, the ARPANET crashed completely.
The crash was complex and involved several routers. At one point, LSP 40 and LSP 44 from one of the routers
were stored in the LSDB of some routers in the ARPANET. As LSP 44 was the newest, it should have replaced
LSP 40 on all routers. Unfortunately, one of the ARPANET routers suffered from a memory problem and sequence
number 40 (101000 in binary) was replaced by 8 (001000 in binary) in the buggy router and flooded. Three LSPs
were present in the network and 44 was newer than 40 which is newer than 8, but unfortunately 8 was considered
to be newer than 44... All routers started to exchange these three link state packets for ever and the only solution
to recover from this problem was to shutdown the entire network RFC 789.
Current link state routing protocols usually use 32 bits sequence numbers and include a special mechanism in the
unlikely case that a sequence number reaches the maximum value (using a 32 bits sequence number space takes
136 years if a link state packet is generated every second).
To deal with the memory corruption problem, link state packets contain a checksum. This checksum is computed
by the router that generates the LSP. Each router must verify the checksum when it receives or floods an LSP.
Furthermore, each router must periodically verify the checksums of the LSPs stored in its LSDB.
Flooding is illustrated in the figure below. By exchanging HELLO messages, each router learns its direct neighbours. For example, router E learns that it is directly connected to routers D, B and C. Its first LSP has sequence
number 0 and contains the directed links E->D, E->B and E->C. Router E sends its LSP on all its links and routers
D, B and C insert the LSP in their LSDB and forward it over their other links.
Flooding allows LSPs to be distributed to all routers inside the network without relying on routing tables. In the
example above, the LSP sent by router E is likely to be sent twice on some links in the network. For example,
routers B and C receive Es LSP at almost the same time and forward it over the B-C link. To avoid sending the
same LSP twice on each link, a possible solution is to slightly change the pseudo-code above so that a router waits
for some random time before forwarding a LSP on each link. The drawback of this solution is that the delay to
flood an LSP to all routers in the network increases. In practice, routers immediately flood the LSPs that contain
new information (e.g. addition or removal of a link) and delay the flooding of refresh LSPs (i.e. LSPs that contain
exactly the same information as the previous LSP originating from this router) [FFEB2005].
To ensure that all routers receive all LSPs, even when there are transmissions errors, link state routing protocols
use reliable flooding. With reliable flooding, routers use acknowledgements and if necessary retransmissions
to ensure that all link state packets are successfully transferred to all neighbouring routers. Thanks to reliable
flooding, all routers store in their LSDB the most recent LSP sent by each router in the network. By combining
the received LSPs with its own LSP, each router can compute the entire network topology.
Note: Static or dynamic link metrics ?
49
50
As link state packets are flooded regularly, routers are able to measure the quality (e.g. delay or load) of their
links and adjust the metric of each link according to its current quality. Such dynamic adjustments were included
in the ARPANET routing protocol [MRR1979] . However, experience showed that it was difficult to tune the
dynamic adjustments and ensure that no forwarding loops occur in the network [KZ1989]. Todays link state
routing protocols use metrics that are manually configured on the routers and are only changed by the network
operators or network management tools [FRT2002].
When a link fails, the two routers attached to the link detect the failure by the lack of HELLO messages received
in the last    seconds. Once a router has detected a local link failure, it generates and floods a new LSP that
no longer contains the failed link and the new LSP replaces the previous LSP in the network. As the two routers
attached to a link do not detect this failure exactly at the same time, some links may be announced in only one
direction. This is illustrated in the figure below. Router E has detected the failures of link E-B and flooded a new
LSP, but router B has not yet detected the failure.
It should be noted that link state routing assumes that all routers in the network have enough memory to store the entire LSDB. The
routers that do not have enough memory to store the entire LSDB cannot participate in link state routing. Some link state routing protocols
allow routers to report that they do not have enough memory and must be removed from the graph by the other routers in the network.
51
2.3 Applications
The are two important models used to organise a networked application. The first and oldest model is the clientserver model. In this model, a server provides services to clients that exchange information with it. This model is
highly asymmetrical : clients send requests and servers perform actions and return responses. It is illustrated in
the figure below.
52
 Bob : 11:55
 Alice : Thank you
 Bob : Youre welcome
Such a conversation succeeds if both Alice and Bob speak the same language. If Alice meets Tchang who only
speaks Chinese, she wont be able to ask him the current time. A conversation between humans can be more
complex. For example, assume that Bob is a security guard whose duty is to only allow trusted secret agents to
enter a meeting room. If all agents know a secret password, the conversation between Bob and Trudy could be as
follows :
 Bob : What is the secret password ?
 Trudy : 1234
 Bob : This is the correct password, youre welcome
If Alice wants to enter the meeting room but does not know the password, her conversation could be as follows :
 Bob : What is the secret password ?
 Alice : 3.1415
 Bob : This is not the correct password.
Human conversations can be very formal, e.g. when soldiers communicate with their hierarchy, or informal such
as when friends discuss. Computers that communicate are more akin to soldiers and require well-defined rules to
ensure an successful exchange of information. There are two types of rules that define how information can be
exchanged between computers :
 syntactical rules that precisely define the format of the messages that are exchanged. As computers only
process bits, the syntactical rules specify how information is encoded as bit strings
 organisation of the information flow. For many applications, the flow of information must be structured and
there are precedence relationships between the different types of information. In the time example above,
Alice must greet Bob before asking for the current time. Alice would not ask for the current time first and
greet Bob afterwards. Such precedence relationships exist in networked applications as well. For example,
a server must receive a username and a valid password before accepting more complex commands from its
clients.
Let us first discuss the syntactical rules. We will later explain how the information flow can be organised by
analysing real networked applications.
Application-layer protocols exchange two types of messages. Some protocols such as those used to support
electronic mail exchange messages expressed as strings or lines of characters. As the transport layer allows hosts
to exchange bytes, they need to agree on a common representation of the characters. The first and simplest method
to encode characters is to use the ASCII table. RFC 20 provides the ASCII table that is used by many protocols
on the Internet. For example, the table defines the following binary representations :
 A : 1000011b
 0 : 0110000b
 z : 1111010b
 @ : 1000000b
 space : 0100000b
In addition, the ASCII table also defines several non-printable or control characters. These characters were designed to allow an application to control a printer or a terminal. These control characters include CR and LF, that
are used to terminate a line, and the Bell character which causes the terminal to emit a sound.
 carriage return (CR) : 0001101b
 line feed (LF) : 0001010b
 Bell: 0000111b
2.3. Applications
53
The ASCII characters are encoded as a seven bits field, but transmitted as an eight-bits byte whose high order bit
is usually set to 0. Bytes are always transmitted starting from the high order or most significant bit.
Most applications exchange strings that are composed of fixed or variable numbers of characters. A common
solution to define the character strings that are acceptable is to define them as a grammar using a Backus-Naur
Form (BNF) such as the Augmented BNF defined in RFC 5234. A BNF is a set of production rules that generate
all valid character strings. For example, consider a networked application that uses two commands, where the
user can supply a username and a password. The BNF for this application could be defined as shown in the figure
below.
For example, the htonl(3) (resp. ntohl(3)) function the standard C library converts a 32-bits unsigned integer from the byte order
used by the CPU to the network byte order (resp. from the network byte order to the CPU byte order). Similar functions exist in other
54
55
57
transfer SDUs. Connections usually provide one bidirectional stream supporting the exchange of SDUs between
the two users that are associated through the connection. This stream is used to transfer data during the second
phase of the connection called the data transfer phase. The third phase is the termination of the connection. Once
the users have finished exchanging SDUs, they request to the service provider to terminate the connection. As we
will see later, there are also some cases where the service provider may need to terminate a connection itself.
The establishment of a connection can be modelled by using four primitives : Connect.request, Connect.indication,
Connect.response and Connect.confirm. The Connect.request primitive is used to request the establishment of a
connection. The main parameter of this primitive is the address of the destination user. The service provider
delivers a Connect.indication primitive to inform the destination user of the connection attempt. If it accepts to
establish a connection, it responds with a Connect.response primitive. At this point, the connection is considered to
be established and the destination user can start sending SDUs over the connection. The service provider processes
the Connect.response and will deliver a Connect.confirm to the user who initiated the connection. The delivery
of this primitive terminates the connection establishment phase. At this point, the connection is considered to be
open and both users can send SDUs. A successful connection establishment is illustrated below.
58
user. If each SDU contains a command, the receiving user can process each command as soon as it receives a
SDU.
59
60
Service
Host B
DATA.ind(request)
DATA.resp(response)
DATA.confirm(response)
61
Figure 2.44: Interactions between the transport layer, its user, and its network layer provider
 an error detection mechanism that allows to detect corrupted data
 a multiplexing technique that enables several applications running on one host to exchange information with
another host
To exchange data, the transport protocol encapsulates the SDU produced by its user inside a segment. The segment
is the unit of transfert of information in the transport layer. Transport layer entities always exchange segments.
When a transport layer entity creates a segment, this segment is encapsulated by the network layer into a packet
which contains the segment as its payload and a network header. The packet is then encapsulated in a frame to be
transmitted in the datalink layer.
A segment also contains control information, usually stored inside a header and the payload that comes from the
application. To detect transmission errors, transport protocols rely on checksums or CRCs like the datalink layer
protocols.
Compared to the connectionless network layer service, the transport layer service allows several applications
running on a host to exchange SDUs with several other applications running on remote hosts. Let us consider two
hosts, e.g. a client and a server. The network layer service allows the client to send information to the server,
but if an application running on the client wants to contact a particular application running on the server, then an
additional addressing mechanism is required other than the network layer address that identifies a host, in order to
differentiate the application running on a host. This additional addressing can be provided by using port numbers.
When a server application is launched on a host, it registers a port number. This port number will be used by the
clients to contact the server process.
The figure below shows a typical usage of port numbers. The client process uses port number 1234 while the
server process uses port number 5678. When the client sends a request, it is identified as originating from port
number 1234 on the client host and destined to port number 5678 on the server host. When the server process
replies to this request, the servers transport layer returns the reply as originating from port 5678 on the server host
and destined to port 1234 on the client host.
To support the connection-oriented service, the transport layer needs to include several mechanisms to enrich the
connectionless network-layer service. We discuss these mechanisms in the following sections.
Connection establishment
Like the connectionless service, the connection-oriented service allows several applications running on a given
host to exchange data with other hosts. The port numbers described above for the connectionless service are also
used by the connection-oriented service to multiplex several applications. Similarly, connection-oriented protocols
used checksums/CRCs to detect transmission errors and discard segments containing an invalid checksum/CRC.
An important difference between the connectionless service and the connection-oriented one is that the transport
62
63
64
65
66
Data transfer
Now that the transport connection has been established, it can be used to transfer data. To ensure a reliable delivery
of the data, the transport protocol will include sliding windows, retransmission timers and go-back-n or selective
repeat. However, we cannot simply reuse the techniques from the datalink because a transport protocol needs to
deal with more types of errors than a reliable protocol in datalink layer. The first difference between the two layers
is the transport layer must face with more variable delays. In the datalink layer, when two hosts are connected
by a link, the transmission delay or the round-trip-time over the link is almost fixed. In a network that can span
the globe, the delays and the round-trip-times can vary significantly on a per packet basis. This variability can
be caused by two factors. First, packets sent through a network do not necessarily follow the same path to reach
their destination. Second, some packets may be queued in the buffers of routers when the load is high and these
queueing delays can lead to increased end-to-end delays. A second difference between the datalink layer and the
transport layer is that a network does not always deliver packets in sequence. This implies that packets may be
reordered by the network. Furthermore, the network may sometimes duplicate packets. The last issue that needs
to be dealt with in the transport layer is the transmission of large SDUs. In the datalink layer, reliable protocols
transmit small frames. Applications could generate SDUs that are much larger than the maximum size of a packet
in the network layer. The transport layer needs to include mechanisms to fragment and reassemble these large
SDUs.
To deal with all these characteristics of the network layer, we need to adapt the techniques that we have introduced
in the datalink layer.
The first point which is common between the two layers is that both use CRCs or checksum to detect transmission
errors. Each segment contains a CRC/checksum which is computed over the entire segment (header and payload)
by the sender and inserted in the header. The receiver recomputes the CRC/checksum for each received segment
and discards all segments with an invalid CRC.
Reliable transport protocols also use sequence numbers and acknowledgement numbers. While reliable protocols
in the datalink layer use one sequence number per frame, reliable transport protocols consider all the data transmitted as a stream of bytes. In these protocols, the sequence number placed in the segment header corresponds to
the position of the first byte of the payload in the bytestream. This sequence number allows to detect losses but
also enables the receiver to reorder the out-of-sequence segments. This is illustrated in the figure below.
Host A
DATA.req(abcde)
Host B
DATA.ind(abcde)
1:abcd
DATA.req(fghijkl)
DATA.ind(fghijkl)
5:fghijkl
Using sequence numbers to count bytes has also one advantage when the transport layer needs to fragment SDUs
in several segments. The figure below shows the fragmentation of a large SDU in two segments. Upon reception
of the segments, the receiver will use the sequence numbers to correctly reorder the data.
Host A
DATA.req(abcdefghijkl)
Host B
1:abcde
DATA.ind(abcdefghijkl)
6:fghijkl
Compared to reliable protocols in the datalink layer, reliable transport protocols encode their sequence numbers
in more bits. 32 bits and 64 bits sequence numbers are frequent in the transport layer while some datalink layer
protocols encode their sequence numbers in an 8 bits field. This large sequence number space is motivated by two
reasons. First, since the sequence number is incremented for each transmitted byte, a single segment may consume
one or several thousands of sequence numbers. Second, a reliable transport protocol must be able to detect delayed
segments. This can only be done if the number of bytes transmitted during the MSL period is smaller than the
sequence number space. Otherwise, there is a risk of accepting duplicate segments.
Go-back-n and selective repeat can be used in the transport layer as in the datalink layer. Since the network layer
does not guarantee an in-order delivery of the packets, a transport entity should always store the segments that
67
it receives out-of-sequence. For this reason, most transport protocols will opt for some form of selective repeat
mechanism.
In the datalink layer, the sliding window has usually a fixed size which depends on the amount of buffers allocated
to the datalink layer entity. Such a datalink layer entity usually serves one or a few network layer entities. In
the transport layer, the situation is different. A single transport layer entity serves a large and varying number of
application processes. Each transport layer entity manages a pool of buffers that needs to be shared between all
these processes. Transport entity are usually implemented inside the operating system kernel and shares memory
with other parts of the system. Furthermore, a transport layer entity must support several (possibly hundreds or
thousands) of transport connections at the same time. This implies that the memory which can be used to support
the sending or the receiving buffer of a transport connection may change during the lifetime of the connection 5 .
Thus, a transport protocol must allow the sender and the receiver to adjust their window sizes.
To deal with this issue, transport protocols allow the receiver to advertise the current size of its receiving window
in all the acknowledgements that it sends. The receiving window advertised by the receiver bounds the size of
the sending buffer used by the sender. In practice, the sender maintains two state variables : swin, the size of its
sending window (that may be adjusted by the system) and rwin, the size of the receiving window advertised by
the receiver. At any time, the number of unacknowledged segments cannot be larger than min(swin,rwin) 6 . The
utilisation of dynamic windows is illustrated in the figure below.
For a discussion on how the sending buffer can change, see e.g. [SMM1998]
Note that if the receive window shrinks, it might happen that the sender has already sent a segment that is not anymore inside its window.
This segment will be discarded by the receiver and the sender will retransmit it later.
6
68
69
the network layer to enforce a Maximum Segment Lifetime (MSL). The network layer must ensure that no packet
remains in the network for more than MSL seconds. In the Internet the MSL is assumed 7 to be 2 minutes RFC
793. Note that this limits the maximum bandwidth of a transport protocol. If it uses n bits to encode its sequence
numbers, then it cannot send more than 2 segments every MSL seconds.
Connection release
When we discussed the connection-oriented service, we mentioned that there are two types of connection releases
: abrupt release and graceful release.
The first solution to release a transport connection is to define a new control segment (e.g. the DR segment) and
consider the connection to be released once this segment has been sent or received. This is illustrated in the figure
below.
70
71
the hosts.txt file. In 1984, the .gov, .edu, .com, .mil and .org generic top-level domain names were added and
RFC 1032 proposed the utilisation of the two letter ISO-3166 country codes as top-level domain names. Since
ISO-3166 defines a two letter code for each country recognised by the United Nations, this allowed all countries
to automatically have a top-level domain. These domains include .be for Belgium, .fr for France, .us for the USA,
.ie for Ireland or .tv for Tuvalu, a group of small islands in the Pacific and .tm for Turkmenistan. Today, the set
of top-level domain-names is managed by the Internet Corporation for Assigned Names and Numbers (ICANN).
Recently, ICANN added a dozen of generic top-level domains that are not related to a country and the .cat top-level
domain has been registered for the Catalan language. There are ongoing discussions within ICANN to increase
the number of top-level domains.
Each top-level domain is managed by an organisation that decides how sub-domain names can be registered. Most
top-level domain names use a first-come first served system, and allow anyone to register domain names, but
there are some exceptions. For example, .gov is reserved for the US government, .int is reserved for international
organisations and names in the .ca are mainly reserved for companies or users who are present in Canada.
72
73
only on addresses. In this case, the server process would be installed on one host and the clients would connect to
this server to retrieve information. Such a deployment has several drawbacks :
 if the server process moves to another physical server, all clients must be informed about the new server
address
 if there are many concurrent clients, the load of the server will increase without any possibility of adding
another server without changing the server addresses user by the clients
Using names solves these problems and provide additional benefits. If clients are configured with the name of the
server, they will query the name service before connecting to the server. The name service will resolve the name
into the corresponding address. If a server process needs to move from one physical server to another, it suffices
to update the name to address mapping of the name service to allow all clients to connect to the new server. The
name service also enables the servers to better sustain be load. Assume a very popular server which is accessed
by millions of user. This service cannot be provided by a single physical server due to performance limitations.
Thanks to the utilisation of names, it is possible to scale this service by mapping a given name to a set of addresses.
When a client queries the name service for the servers name, the name service returns one of the addresses in the
set. Various strategies can be used to select one particular address inside the set of addresses. A first strategy is to
select a random address in the set. A second strategy is to maintain information about the load on the servers and
return the address of the less loaded server. Note that the list of server addresses does not need to remain fixed. It
is possible to add and remove addresses from the list to cope with load fluctuations. Another strategy is to infer
the location of the client from the name request and return the address of the closest server.
Mapping a single name onto a set of addresses allow popular servers to scale dynamically. There are also benefits
in mapping multiple names, possibly a large number of them, onto a single address. Consider the case of information servers run by individuals or SMEs. Some of these servers attract only a few clients per day. Using a single
physical server for each of these services would be a waste of resources. A better approach is to use a single server
for a set of services that are all identified by different names. This enables service providers to support a large
number of servers, identifiied by different names, onto a single physical server. If one of these servers becomes
very popular, it will be possible to map its name onto a set of addresses to be able to sustain the load. There are
some deployments where this mapping is done dynamically in function of the load.
Names provide a lot of flexibility compared to addresses. For the network, they play a similar role as variables
in programming languages. No programmer using a high-level programming language would consider using
addresses instead of variables. For the same reasons, all networked applications should depend on names and
avoid dealing with addresses as much as possible.
74
75
attached to the ring. From a redundancy point of view, a single ring is not the best solution, as the signal only
travels in one direction on the ring; thus if one of the links composing the ring is cut, the entire network fails. In
practice, such rings have been used in local area networks, but are now often replaced by star-shaped networks.
In metropolitan networks, rings are often used to interconnect multiple locations. In this case, two parallel links,
composed of different cables, are often used for redundancy. With such a dual ring, when one ring fails all the
traffic can be quickly switched to the other ring.
76
A1
R1
10 Mbps
B1
R2
A2
20 Mbps
R3
C1
B2
C2
In large networks, fairness is always a compromise. The most widely used definition of fairness is the max-min
fairness. A bandwidth allocation in a network is said to be max-min fair if it is such that it is impossible to
allocate more bandwidth to one of the flows without reducing the bandwidth of a flow that already has a smaller
allocation than the flow that we want to increase. If the network is completely known, it is possible to derive a
max-min fair allocation as follows. Initially, all flows have a null bandwidth and they are placed in the candidate
set. The bandwidth allocation of all flows in the candidate set is increased until one link becomes congested. At
this point, the flows that use the congested link have reached their maximum allocation. They are removed from
the candidate set and the process continues until the candidate set becomes empty.
In the above network, the allocation of all flows would grow until A1-A2 and B1-B2 reach 5 Mbps. At this point,
link R1-R2 becomes congested and these two flows have reached their maximum. The allocation for flow C1-C2
can increase until reaching 15 Mbps. At this point, link R2-R3 is congested. To increase the bandwidth allocated
to C1-C2, one would need to reduce the allocation to flow B1-B2. Similarly, the only way to increase the allocation
to flow B1-B2 would require a decrease of the allocation to A1-A2.
77
A
R1
R2
C
In the network above, consider the case where host A is transmitting packets to destination C. A can send one
packet per second and its packets will be delivered to C. Now, let us explore what happens when host B also starts
to transmit a packet. Node R1 will receive two packets that must be forwarded to R2. Unfortunately, due to the
limited bandwidth on the R1-R2 link, only one of these two packets can be transmitted. The outcome of the second
packet will depend on the available buffers on R1. If R1 has one available buffer, it could store the packet that
has not been transmitted on the R1-R2 link until the link becomes available. If R1 does not have available buffers,
then the packet needs to be discarded.
Besides the link bandwidth, the buffers on the network nodes are the second type of resource that needs to be
shared inside the network. The node buffers play an important role in the operation of the network because that
can be used to absorb transient traffic peaks. Consider again the example above. Assume that one average host
A and host B send a group of three packets every ten seconds. Their combined transmission rate (0.6 packets
per second) is, on average, lower than the network capacity (1 packet per second). However, if they both start
to transmit at the same time, node R1 will have to absorb a burst of packets. This burst of packets is a small
network congestion. We will say that
 a network is congested, when the sum of the traffic demand from the hosts
is larger than the network capacity  > . This network congestion problem is one of the most
difficult resource sharing problem in computer networks. Congestion occurs in almost all networks. Minimizing
the amount of congestion is a key objective for many network operators. In most cases, they will have to accept
transient congestion, i.e. congestion lasting a few seconds or perhaps minutes, but will want to prevent congestion
that lasts days or months. For this, they can rely on a wide range of solutions. We briefly present some of these in
the paragraphs below.
If R1 has enough buffers, it will be able to absorb the load without having to discard packets. The packets sent by
hosts A and B will reach their final destination C, but will experience a longer delay than when they are transmitting
alone. The amount of buffering on the network node is the first paper that a network operator can tune to control
congestion inside his network. Given the decreasing cost of memory, one could be tempted to put as many buffers
14
as possible on the network nodes. Let us consider this case in the network above and assume that R1 has infinite
buffers. Assume now that hosts A and B try to transmit a file that corresponds to one thousand packets each.
Both are using a reliable protocol that relies on go-back-n to recover from transmission errors. The transmission
starts and packets start to accumulate in R1s buffers. These presence of these packets in the buffers increases the
delay between the transmission of a packet by A and the return of the corresponding acknowledgement. Given the
increasing delay, host A (and B as well) will consider that some of the packets that it sent have been lost. These
packets will be retransmitted and will enter the buffers of R1. The occupancy of the buffers of R1 will continue
to increase and the delays as well. This will cause new retransmissions, ... In the end, several copies of the same
packet will be transmitted over the R1-R2, but only one file will be delivered (very slowly) to the destination.
This is known as the congestion collapse problem RFC 896. Congestion collapse is the nightmare for network
operators. When it happens, the network carries packets without delivering useful data to the end users.
14 There are still some vendors that try to put as many buffers as possible on their network nodes. A recent example is the buffer bloat
problem that plagues some low-end Internet routers [GN2011].
78
79
few packets inside the buffer will cause a small variation in the delay which may not necessarily be larger that the
natural fluctuations of the delay measurements.
If the buffers occupancy continues to grow, it will overflow and packets will need to be discarded. Discarding
packets during congestion is the second possible reaction of a network node to congestion. Before looking at
how a node can discard packets, it is interesting to discuss qualitatively the impact of the buffer occupancy on the
reliable delivery of data through a network. This is illustrated by the figure below, adapted from [Jain1990].
80
have two advantages. First, it already stayed a long time in the buffer. Second, hosts should be able to detect
the loss (and thus the congestion) earlier.
 probabilistic drop. Various random drop techniques have been proposed. Compared to the previous techniques. A frequently cited technique is Random Early Discard (RED) [FJ1993]. RED measures the average
buffer occupancy and probabilistically discards packets when this average occupancy is too high. Compared to tail drop and drop from front, an advantage of RED is that thanks to the probabilistic drops, packets
should be discarded from different flows in proportion of their bandwidth.
Discarding packets is a frequent reaction to network congestion. Unfortunately, discarding packets is not optimal
since a packet which is discarded on a network node has already consumed resources on the upstream nodes.
There are other ways for the network to inform the end hosts of the current congestion level. A first solution is to
mark the packets when a node is congested. Several networking technologies have relied on this kind of packet
marking.
In datagram networks, Forward Explicit Congestion Notification (FECN) can be used. One field of the packet
header, typically one bit, is used to indicate congestion. When a host sends a packet, the congestion bit is reset.
If the packet passes through a congested node, the congestion bit is set. The destination can then determine the
current congestion level by measuring the fraction of the packets that it received with the congestion bit set. It may
then return this information to the sending host to allow it to adapt its retransmission rate. Compared to packet
discarding, the main advantage of FECN is that hosts can detect congestion explicitly without having to rely on
packet losses.
In virtual circuit networks, packet marking can be improved if the return packets follow the reverse path of the
forward packets. It this case, a network node can detect congestion on the forward path (e.g. due to the size of its
buffer), but mark the packets on the return path. Marking the return packets (e.g. the acknowledgements used by
reliable protocols) provides a faster feedback to the sending hosts compared to FECN. This technique is usually
called Backward Explicit Congestion Notification (BECN).
If the packet header does not contain any bit in the header to represent the current congestion level, an alternative
is to allow the network nodes to send a control packet to the source to indicate the current congestion level. Some
networking technologies use such control packets to explicitly regulate the transmission rate of sources. However,
their usage is mainly restricted to small networks. In large networks, network nodes usually avoid using such
control packets. These controlled packets are even considered to be dangerous in some networks. First, using
them increases the network load when the network is congested. Second, while network nodes are optimized to
forward packets, they are usually pretty slow at creating new packets.
Dropping and marking packets is not the only possible reaction of a router that becomes congested. A router
could also selectively delay packets belonging to some flows. There are different algorithms that can be used by a
router to delay packets. If the objective of the router is to fairly distribute to bandwidth of an output link among
competing flows, one possibility is to organize the buffers of the router as a set of queues. For simplicity, let us
assume that the router is capable of supporting a fixed number of concurrent flows, say N. One of the queues of the
router is associated to each flow and when a packet arrives, it is placed at the tail of the corresponding queue. All
the queues are controlled by a scheduler. A scheduler is an algorithm that is run each time there is an opportunity
to transmit a packet on the outgoing link. Various schedulers have been proposed in the scientific literature and
some are used in real routers.
81
A very simple scheduler is the round-robin scheduler. This scheduler serves all the queues in a round-robin
fashion. If all flows send packets of the same size, then the round-robin scheduler allocates the bandwidth fairly
among the different flows. Otherwise, it favors flows that are using larger packets. Extensions to the roundrobin scheduler have been proposed to provide a fair distribution of the bandwidth with variable-length packets
[SV1995] but these are outside the scope of this chapter.
# N queues
# state variable : next_queue
next_queue=0
while (true) :
if isEmpty(buffer) :
wait
# wait for next packet in buffer
if !isEmpty(queue[next_queue]) :
# Send packet at head of next_queue
p=remove_packet(queue[next_queue])
send(p)
next_queue=(next_queue+1)%N
# end while
82
A first approach is to store the file on servers whose name is known by the clients. Before retrieving the file, each
client will query the name service to obtain the address of the server. If the file is available from many servers,
the name service can provide different addresses to different clients. This will automatically spread the load since
different clients will download the file from different servers. Most large content providers use such a solution to
distribute large files or videos.
There is another solution that allows to spread the load among many sources without relying on the name service.
The popular bittorent service is an example of this approach. With this solution, each file is divided in blocks of a
fixed size. To retrieve a file, a client needs to retrieve all the blocks that compose the file. However, nothing forces
the client to retrieve all the blocks in sequence and from the same server. Each file is associated with metadata
that indicates for each block a list of addresses of hosts that store this block. To retrieve a complete file, a client
first downloads the metadata. Then, it tries to retrieve each block from one of the hosts that store the block. In
practice, implementations often try to download several blocks in parallel. Once one block has been successfully
downloaded, the next block can be requested. If a host is slow to provide one block or becomes unavailable,
the client can contact another host listed in the metadata. Most deployments of bittorrent allow the clients to
participate to the distribution of blocks. Once a client has downloaded one block, it contacts the server which
stores the metadata to indicate that it can also provide this block. With this scheme, when a file is popular, its
blocks are downloaded by many hosts that automatically participate in the distribution of the blocks. Thus, the
number of servers that are capable of providing blocks from a popular file automatically increases with the files
popularity.
Now that we have provided a broad overview of the techniques that can be used to spread the load and allocate
resources in the network, let us analyze two techniques in more details : Medium Access Control and Congestion
control.
83
given frequency. The radio spectrum corresponds to frequencies ranging between roughly 3 KHz and 300 GHz.
Frequency allocation plans negotiated among governments reserve most frequency ranges for specific applications
such as broadcast radio, broadcast television, mobile communications, aeronautical radio navigation, amateur radio, satellite, etc. Each frequency range is then subdivided into channels and each channel can be reserved for a
given application, e.g. a radio broadcaster in a given region.
Frequency Division Multiplexing (FDM) is a static allocation scheme in which a frequency is allocated to each
device attached to the shared medium. As each device uses a different transmission frequency, collisions cannot
occur. In optical networks, a variant of FDM called Wavelength Division Multiplexing (WDM) can be used. An
optical fiber can transport light at different wavelengths without interference. With WDM, a different wavelength
is allocated to each of the devices that share the same optical fiber.
Time Division Multiplexing (TDM) is a static bandwidth allocation method that was initially defined for the telephone network. In the fixed telephone network, a voice conversation is usually transmitted as a 64 Kbps signal.
Thus, a telephone conservation generates 8 KBytes per second or one byte every 125 microseconds. Telephone
conversations often need to be multiplexed together on a single line. For example, in Europe, thirty 64 Kbps voice
signals are multiplexed over a single 2 Mbps (E1) line. This is done by using Time Division Multiplexing (TDM).
TDM divides the transmission opportunities into slots. In the telephone network, a slot corresponds to 125 microseconds. A position inside each slot is reserved for each voice signal. The figure below illustrates TDM on a
link that is used to carry four voice conversations. The vertical lines represent the slot boundaries and the letters
the different voice conversations. One byte from each voice conversation is sent during each 125 microseconds
slot. The byte corresponding to a given conversation is always sent at the same position in each slot.
The second channel was shared among all terminals to send frames to the mainframe. As all terminals share the
same transmission channel, there is a risk of collision. To deal with this problem as well as transmission errors,
the mainframe verified the parity bits of the received frame and sent an acknowledgement on its channel for each
correctly received frame. The terminals on the other hand had to retransmit the unacknowledged frames. As for
TCP, retransmitting these frames immediately upon expiration of a fixed timeout is not a good approach as several
terminals may retransmit their frames at the same time leading to a network collapse. A better approach, but still
far from perfect, is for each terminal to wait a random amount of time after the expiration of its retransmission
timeout. This avoids synchronisation among multiple retransmitting terminals.
The pseudo-code below shows the operation of an ALOHANet terminal. We use this python syntax for all Medium
Access Control algorithms described in this chapter. The algorithm is applied to each new frame that needs to be
transmitted. It attempts to transmit a frame at most max times (while loop). Each transmission attempt is performed
as follows: First, the frame is sent. Each frame is protected by a timeout. Then, the terminal waits for either a
valid acknowledgement frame or the expiration of its timeout. If the terminal receives an acknowledgement, the
frame has been delivered correctly and the algorithm terminates. Otherwise, the terminal waits for a random time
and attempts to retransmit the frame.
# ALOHA
N=1
while N<= max :
send(frame)
wait(ack_on_return_channel or timeout)
if (ack_on_return_channel):
break # transmission was successful
else:
# timeout
wait(random_time)
N=N+1
else:
# Too many transmission attempts
[Abramson1970] analysed the performance of ALOHANet under particular assumptions and found that ALOHANet worked well when the channel was lightly loaded. In this case, the frames are rarely retransmitted and the
channel traffic, i.e. the total number of (correct and retransmitted) frames transmitted per unit of time is close to
the channel utilization, i.e. the number of correctly transmitted frames per unit of time. Unfortunately, the analysis
1
also reveals that the channel utilization reaches its maximum at 2
= 0.186 times the channel bandwidth. At
higher utilization, ALOHANet becomes unstable and the network collapses due to collided retransmissions.
Note: Amateur packet radio
Packet radio technologies have evolved in various directions since the first experiments performed at the University
of Hawaii. The Amateur packet radio service developed by amateur radio operators is one of the descendants
ALOHANet. Many amateur radio operators are very interested in new technologies and they often spend countless
hours developing new antennas or transceivers. When the first personal computers appeared, several amateur radio
operators designed radio modems and their own datalink layer protocols [KPD1985] [BNT1997]. This network
grew and it was possible to connect to servers in several European countries by only using packet radio relays.
Some amateur radio operators also developed TCP/IP protocol stacks that were used over the packet radio service.
Some parts of the amateur packet radio network are connected to the global Internet and use the 44.0.0.0/8 prefix.
Many improvements to ALOHANet have been proposed since the publication of [Abramson1970], and this technique, or some of its variants, are still found in wireless networks today. The slotted technique proposed in
[Roberts1975] is important because it shows that a simple modification can significantly improve channel utilization. Instead of allowing all terminals to transmit at any time, [Roberts1975] proposed to divide time into slots
and allow terminals to transmit only at the beginning of each slot. Each slot corresponds to the time required to
transmit one fixed size frame. In practice, these slots can be imposed by a single clock that is received by all
terminals. In ALOHANet, it could have been located on the central mainframe. The analysis in [Roberts1975]
reveals that this simple modification improves the channel utilization by a factor of two.
85
The above pseudo-code is often called persistent CSMA [KT1975] as the terminal will continuously listen to the
channel and transmit its frame as soon as the channel becomes free. Another important variant of CSMA is the
non-persistent CSMA [KT1975]. The main difference between persistent and non-persistent CSMA described
in the pseudo-code below is that a non-persistent CSMA node does not continuously listen to the channel to
determine when it becomes free. When a non-persistent CSMA terminal senses the transmission channel to be
busy, it waits for a random time before sensing the channel again. This improves channel utilization compared to
persistent CSMA. With persistent CSMA, when two terminals sense the channel to be busy, they will both transmit
(and thus cause a collision) as soon as the channel becomes free. With non-persistent CSMA, this synchronisation
does not occur, as the terminals wait a random time after having sensed the transmission channel. However, the
higher channel utilization achieved by non-persistent CSMA comes at the expense of a slightly higher waiting
time in the terminals when the network is lightly loaded.
# Non persistent CSMA
N=1
while N<= max :
listen(channel)
if free(channel):
send(frame)
wait(ack or timeout)
if received(ack) :
break # transmission was successful
else :
# timeout
N=N+1
else:
wait(random_time)
# end of while loop
# Too many transmission attempts
[KT1975] analyzes in detail the performance of several CSMA variants. Under some assumptions about the transmission channel and the traffic, the analysis compares ALOHA, slotted ALOHA, persistent and non-persistent
CSMA. Under these assumptions, ALOHA achieves a channel utilization of only 18.4% of the channel capacity.
Slotted ALOHA is able to use 36.6% of this capacity. Persistent CSMA improves the utilization by reaching
52.9% of the capacity while non-persistent CSMA achieves 81.5% of the channel capacity.
86
87
88
89
The inter-frame delay used in this pseudo-code is a short delay corresponding to the time required by a network
adapter to switch from transmit to receive mode. It is also used to prevent a host from sending a continuous
stream of frames without leaving any transmission opportunities for other hosts on the network. This contributes
to the fairness of CSMA/CD. Despite this delay, there are still conditions where CSMA/CD is not completely fair
[RY1994]. Consider for example a network with two hosts : a server sending long frames and a client sending
acknowledgments. Measurements reported in [RY1994] have shown that there are situations where the client
could suffer from repeated collisions that lead it to wait for long periods of time due to the exponential back-off
algorithm.
Carrier Sense Multiple Access with Collision Avoidance
The Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) Medium Access Control algorithm was
designed for the popular WiFi wireless network technology [802.11]. CSMA/CA also senses the transmission
channel before transmitting a frame. Furthermore, CSMA/CA tries to avoid collisions by carefully tuning the
timers used by CSMA/CA devices.
CSMA/CA uses acknowledgements like CSMA. Each frame contains a sequence number and a CRC. The CRC
is used to detect transmission errors while the sequence number is used to avoid frame duplication. When a
device receives a correct frame, it returns a special acknowledgement frame to the sender. CSMA/CA introduces
a small delay, named Short Inter Frame Spacing (SIFS), between the reception of a frame and the transmission of
90
the acknowledgement frame. This delay corresponds to the time that is required to switch the radio of a device
between the reception and transmission modes.
Compared to CSMA, CSMA/CA defines more precisely when a device is allowed to send a frame. First,
CSMA/CA defines two delays : DIFS and EIFS. To send a frame, a device must first wait until the channel
has been idle for at least the Distributed Coordination Function Inter Frame Space (DIFS) if the previous frame
was received correctly. However, if the previously received frame was corrupted, this indicates that there are
collisions and the device must sense the channel idle for at least the Extended Inter Frame Space (EIFS), with
  <   <  . The exact values for SIFS, DIFS and EIFS depend on the underlying physical layer
[802.11].
The figure below shows the basic operation of CSMA/CA devices. Before transmitting, host A verifies that the
channel is empty for a long enough period. Then, its sends its data frame. After checking the validity of the
received frame, the recipient sends an acknowledgement frame after a short SIFS delay. Host C, which does not
participate in the frame exchange, senses the channel to be busy at the beginning of the data frame. Host C can
use this information to determine how long the channel will be busy for. Note that as   <   <  ,
even a device that would start to sense the channel immediately after the last bit of the data frame could not decide
to transmit its own frame during the transmission of the acknowledgement frame.
91
92
The pseudo-code below summarizes the operation of a CSMA/CA device. The values of the SIFS, DIFS, EIFS
and slotTime depend on the underlying physical layer technology [802.11]
# CSMA/CA simplified pseudo-code
N=1
while N<= max :
waitUntil(free(channel))
if correct(last_frame) :
wait(channel_free_during_t >=DIFS)
else:
wait(channel_free_during_t >=EIFS)
back-off_time = int(random[0,min(255,7*(2^(N-1)))])*slotTime
wait(channel free during backoff_time)
# backoff timer is frozen while channel is sensed to be busy
send(frame)
wait(ack or timeout)
if received(ack)
# frame received correctly
break
else:
# retransmission required
N=N+1
Another problem faced by wireless networks is often called the hidden station problem. In a wireless network,
radio signals are not always propagated same way in all directions. For example, two devices separated by a wall
may not be able to receive each others signal while they could both be receiving the signal produced by a third
host. This is illustrated in the figure below, but it can happen in other environments. For example, two devices that
are on different sides of a hill may not be able to receive each others signal while they are both able to receive the
signal sent by a station at the top of the hill. Furthermore, the radio propagation conditions may change with time.
For example, a truck may temporarily block the communication between two nearby devices.
93
94
95
payload, a CRC, the Ending Delimiter and a Frame Status field. The format of the Token Ring data frames is
illustrated below.
96
97
destination. This self-clocking is the first mechanism that allows a window-based reliable transport protocol to
adapt to heterogeneous networks [Jacobson1988]. It depends on the availability of buffers to store the segments
that have been sent by the sender but have not yet been transmitted to the destination.
However, transport protocols are not only used in this environment. In the global Internet, a large number of hosts
send segments to a large number of receivers. For example, let us consider the network depicted below which is
similar to the one discussed in [Jacobson1988] and RFC 896. In this network, we assume that the buffers of the
router are infinite to ensure that no packet is lost.
98
Let us first consider the simple problem of a set of  hosts that share a single bottleneck link as shown in the
example above. In this network, the congestion control scheme must achieve the following objectives [CJ1989] :
1. The congestion control scheme must avoid congestion. In practice, this means that the bottleneck link cannot be overloaded. If  () is the transmission rate allocated to host  at time  and
 the bandwidth
 of the bottleneck link, then the congestion control scheme should ensure that,
on average,   ()  .
2. The congestion control scheme must be efficient. The bottleneck link is usually both a shared
and an expensive resource. Usually, bottleneck links are wide area links that are much more
expensive to upgrade than the local area networks. The congestion control scheme should ensure
thatsuch links are efficiently used. Mathematically, the control scheme should ensure that
  ()  .
3. The congestion control scheme should be fair. Most congestion schemes aim at achieving maxmin fairness. An allocation of transmission rates to sources is said to be max-min fair if :
 no link in the network is congested
 the rate allocated to source  cannot be increased without decreasing the rate allocated to a
source  whose allocation is smaller than the rate allocated to source  [Leboudec2008] .
Depending on the network, a max-min fair allocation may not always exist. In practice, max-min fairness is an
ideal objective that cannot necessarily be achieved. When there is a single bottleneck link as in the example above,
max-min fairness implies that each source should be allocated the same transmission rate.
To visualise the different rate allocations, it is useful to consider the graph shown below. In this graph, we plot
on the x-axis (resp. y-axis) the rate allocated to host B (resp. A). A point in the graph ( ,  ) corresponds to a
possible allocation of the transmission rates. Since there is a 2 Mbps bottleneck link in this network, the graph
can be divided into two regions. The lower left part of the graph contains all allocations ( ,  ) such that the
bottleneck link is not congested ( +  < 2). The right border of this region is the efficiency line, i.e. the set
of allocations that completely utilise the bottleneck link ( +  = 2). Finally, the fairness line is the set of fair
allocations.
99
listed above. Some congestion control schemes rely on a close cooperation between the endhosts and the routers,
while others are mainly implemented on the endhosts with limited support from the routers.
A congestion control scheme can be modelled as an algorithm that adapts the transmission rate ( ()) of host 
based on the feedback received from the network. Different types of feedbacks are possible. The simplest scheme
is a binary feedback [CJ1989] [Jacobson1988] where the hosts simply learn whether the network is congested or
not. Some congestion control schemes allow the network to regularly send an allocated transmission rate in Mbps
to each host [BF1995].
Let us focus on the binary feedback scheme which is the most widely used today. Intuitively, the congestion
control scheme should decrease the transmission rate of a host when congestion has been detected in the network,
in order to avoid congestion collapse. Furthermore, the hosts should increase their transmission rate when the
network is not congested. Otherwise, the hosts would not be able to efficiently utilise the network. The rate
allocated to each host fluctuates with time, depending on the feedback received from the network. The figure
below illustrates the evolution of the transmission rates allocated to two hosts in our simple network. Initially, two
hosts have a low allocation, but this is not efficient. The allocations increase until the network becomes congested.
At this point, the hosts decrease their transmission rate to avoid congestion collapse. If the congestion control
scheme works well, after some time the allocations should become both fair and efficient.
100
Two types of binary feedback are possible in computer networks. A first solution is to rely on implicit feedback.
This is the solution chosen for TCP. TCPs congestion control scheme [Jacobson1988] does not require any cooperation from the router. It only assumes that they use buffers and that they discard packets when there is congestion.
TCP uses the segment losses as an indication of congestion. When there are no losses, the network is assumed to
be not congested. This implies that congestion is the main cause of packet losses. This is true in wired networks,
but unfortunately not always true in wireless networks. Another solution is to rely on explicit feedback. This
is the solution proposed in the DECBit congestion control scheme [RJ1995] and used in Frame Relay and ATM
networks. This explicit feedback can be implemented in two ways. A first solution would be to define a special
message that could be sent by routers to hosts when they are congested. Unfortunately, generating such messages
may increase the amount of congestion in the network. Such a congestion indication packet is thus discouraged
RFC 1812. A better approach is to allow the intermediate routers to indicate, in the packets that they forward,
their current congestion status. Binary feedback can be encoded by using one bit in the packet header. With such a
scheme, congested routers set a special bit in the packets that they forward while non-congested routers leave this
bit unmodified. The destination host returns the congestion status of the network in the acknowledgements that it
sends. Details about such a solution in IP networks may be found in RFC 3168. Unfortunately, as of this writing,
this solution is still not deployed despite its potential benefits.
101
R1
500 kbps
R2
The links between the hosts and the routers have a bandwidth of 1 Mbps while the link between the two routers
has a bandwidth of 500 Kbps. There is no significant propagation delay in this network. For simplicity, assume
that hosts A and B send 1000 bits packets. The transmission of such a packet on a host-router (resp. router-router
) link requires 1 msec (resp. 2 msec). If there is no traffic in the network, round-trip-time measured by host A
is slightly larger than 4 msec. Let us observe the flow of packets with different window sizes to understand the
relationship between sending window and transmission rate.
Consider first a window of one segment. This segment takes 4 msec to reach host D. The destination replies with
an acknowledgement and the next segment can be transmitted. With such a sending window, the transmission rate
is roughly 250 segments per second of 250 Kbps.
+-----+----------+----------+----------+
|Time | A-R1
| R1-R2
| R2-D
|
102
+=====+==========+==========+==========+
|t0
| data(0) |
|
|
+-----+----------+----------+
|
|t0+1 |
|
|
|
+-----+
| data(0) |
|
|t0+2 |
|
|
|
+-----+
+----------+----------+
|t0+3 |
|
| data(0) |
+-----+----------+
+----------+
|t0+4 | data(1) |
|
|
+-----+----------+----------+
|
|t0+5 |
|
|
|
+-----+
| data(1) |
|
|t0+6 |
|
|
|
+-----+
+----------+----------+
|t0+7 |
|
| data(1) |
+-----+----------+
+----------+
|t0+8 | data(2) |
|
+-----+----------+----------------------
Consider now a window of two segments. Host A can send two segments within 2 msec on its 1 Mbps link. If the
first segment is sent at time 0 , it reaches host D at 0 + 4. Host D replies with an acknowledgement that opens the
sending window on host A and enables it to transmit a new segment. In the meantime, the second segment was
buffered by router R1. It reaches host D at 0 + 6 and an acknowledgement is returned. With a window of two
segments, host A transmits at roughly 500 Kbps, i.e. the transmission rate of the bottleneck link.
+-----+----------+----------+----------+
|Time | A-R1
| R1-R2
| R2-D
|
+=====+==========+==========+==========+
|t0
| data(0) |
|
|
+-----+----------+----------+
|
|t0+1 | data(1) |
|
|
+-----+----------+ data(0) |
|
|t0+2 |
|
|
|
+-----+
+----------+----------+
|t0+3 |
|
| data(0) |
+-----+----------+ data(1) +----------+
|t0+4 | data(2) |
|
|
+-----+----------+----------+----------+
|t0+5 |
|
| data(1) |
+-----+----------+ data(2) +----------+
|t0+6 | data(3) |
|
|
+-----+----------+----------+----------+
Our last example is a window of four segments. These segments are sent at 0 , 0 + 1, 0 + 2 and 0 + 3. The first
segment reaches host D at 0 + 4. Host D replies to this segment by sending an acknowledgement that enables host
A to transmit its fifth segment. This segment reaches router R1 at 0 + 5. At that time, router R1 is transmitting
the third segment to router R2 and the fourth segment is still in its buffers. At time 0 + 6, host D receives the
second segment and returns the corresponding acknowledgement. This acknowledgement enables host A to send
its sixth segment. This segment reaches router R1 at roughly 0 + 7. At that time, the router starts to transmit the
fourth segment to router R2. Since link R1-R2 can only sustain 500 Kbps, packets will accumulate in the buffers
of R1. On average, there will be two packets waiting in the buffers of R1. The presence of these two packets
will induce an increase of the round-trip-time as measured by the transport protocol. While the first segment was
acknowledged within 4 msec, the fifth segment (data(4)) that was transmitted at time 0 + 4 is only acknowledged
at time 0 + 11. On average, the sender transmits at 500 Kbps, but the utilisation of a large window induces a
longer delay through the network.
+-----+----------+----------+----------+
|Time | A-R1
| R1-R2
| R2-D
|
+=====+==========+==========+==========+
|t0
| data(0) |
|
|
+-----+----------+----------+
|
|t0+1 | data(1) |
|
|
103
+-----+----------+ data(0) |
|
|t0+2 | data(2) |
|
|
+-----+----------+----------+----------+
|t0+3 | data(3) |
| data(0) |
+-----+----------+ data(1) +----------+
|t0+4 | data(4) |
|
|
+-----+----------+----------+----------+
|t0+5 |
|
| data(1) |
+-----+----------+ data(2) +----------+
|t0+6 | data(5) |
|
|
+-----+----------+----------+----------+
|t0+7 |
|
| data(2) |
+-----+----------+ data(3) +----------+
|t0+8 | data(6) |
|
|
+-----+----------+----------+----------+
|t0+9 |
|
| data(3) |
+-----+----------+ data(4) +----------+
|t0+10| data(7) |
|
|
+-----+----------+----------+----------+
|t0+11|
|
| data(4) |
+-----+----------+ data(5) +----------+
|t0+12| data(8) |
|
|
+-----+----------+----------+----------+
From the above example, we can adjust the transmission rate by adjusting the sending window of a reliable
transport protocol. A reliable transport protocol cannot send data faster than 
where  is current
sending window. To control the transmission rate, we introduce a congestion window. This congestion window
limits the sending window. A any time, the sending window is restricted to (, ), where swin is
the sending window and cwin the current congestion window. Of course, the window is further constrained by
the receive window advertised by the remote peer. With the utilization of a congestion window, a simple reliable
transport protocol that uses fixed size segments could implement AIMD as follows.
For the Additive Increase part our simple protocol would simply increase its congestion window by one segment
every round-trip-time. The Multiplicative Decrease part of AIMD could be implemented by halving the congestion
window when congestion is detected. For simplicity, we assume that congestion is detected thanks to a binary
feedback and that no segments are lost. We will discuss in more details how losses affect a real transport protocol
like TCP.
A congestion control scheme for our simple transport protocol could be implemented as follows.
# Initialisation
cwin = 1 # congestion window measured in segments
# Ack arrival
if newack : # new ack, no congestion
# increase cwin by one every rtt
cwin = cwin+ (1/cwin)
else:
# no increase
Congestion detected:
cwnd=cwin/2 # only once per rtt
In the above pseudocode, cwin contains the congestion window stored as a real in segments. This congestion
window is updated upon the arrival of each acknowledgment and when congestion is detected. For simplicity, we
assume that cwin is stored as a floating point number but only full segments can be transmitted.
As an illustration, let us consider the network scenario above and assume that the router implements the DECBit
binary feedback scheme [RJ1995]. This scheme uses a form of Forward Explicit Congestion Notification and a
router marks the congestion bit in arriving packets when its buffer contains one or more packets. In the figure
below, we use a * to indicate a marked packet.
104
+-----+----------+----------+----------+
|Time | A-R1
| R1-R2
| R2-D
|
+-----+==========+==========+==========+
|t0
| data(0) |
|
|
+-----+----------+----------+
|
|t0+1 |
|
|
|
+-----+
| data(0) |
|
|t0+2 |
|
|
|
+-----+
+----------+----------+
|t0+3 |
|
| data(0) |
+-----+----------+
+----------+
|t0+4 | data(1) |
|
|
+-----+----------+----------+
|
|t0+5 | data(2) |
|
|
+-----+----------+ data(1) |
|
|t0+6 |
|
|
|
+-----+
+----------+----------+
|t0+7 |
|
| data(1) |
+-----+----------+ data(2) +----------+
|t0+8 | data(3) |
|
|
+-----+----------+----------+----------+
|t0+9 |
|
| data(2) |
+-----+----------+ data(3) +----------+
|t0+10| data(4) |
|
|
+-----+----------+----------+----------+
|t0+11| data(5) |
| data(3) |
+-----+----------+ data(4) +----------+
|t0+12| data(6) |
|
|
+-----+----------+----------+----------+
|t0+13|
|
| data(4) |
+-----+----------+ data(5) +----------+
|t0+14| data(7) |
|
|
+-----+----------+----------+----------+
|t0+15|
|
| data(5) |
+-----+----------+ data*(6) +----------+
|t0+16| data(8) |
|
|
+-----+----------+----------+----------+
|t0+17| data(9) |
| data*(6) |
+-----+----------+ data*(7) +----------+
|t0+18|
|
|
|
+-----+
|----------+---------|t0+19|
|
| data*(7) |
+-----+
| data*(8) +----------+
|t0+20|
|
|
|
+-----+
|----------+----------+
|t0+21|
|
| data*(8) |
+-----+----------+ data*(9) +----------+
|t0+22| data(10) |
|
|
+-----+----------+----------+----------+
When the connection starts, its congestion window is set to one segment. Segment data(0) is sent at acknowledgment at roughly 0 + 4. The congestion window is increased by one segment and data(1) and data(2) are
transmitted at time 0 + 4 and 0 + 5. The corresponding acknowledgements are received at times 0 + 8 and
0 + 10. Upon reception of this last acknowledgement, the congestion window reaches 3 and segments can be sent
(data(4) and data(5)). When segment data(6) reaches router R1, its buffers already contain data(5). The packet
containing data(6) is thus marked to inform the sender of the congestion. Note that the sender will only notice
the congestion once it receives the corresponding acknowledgement at 0 + 18. In the meantime, the congestion
window continues to increase. At 0 + 16, upon reception of the acknowledgement for data(5), it reaches 4. When
congestion is detected, the congestion window is decreased down to 2. This explains the idle time between the
reception of the acknowledgement for data*(6) and the transmission of data(10).
105
106
107
 the Presentation layer was designed to cope with the different ways of representing information on computers. There are many differences in the way computer store information. Some computers store integers
as 32 bits field, others use 64 bits field and the same problem arises with floating point number. For textual
information, this is even more complex with the many different character codes that have been used 20 . The
situation is even more complex when considering the exchange of structured information such as database
records. To solve this problem, the Presentation layer contains provides for a common representation of the
data transferred. The ASN.1 notation was designed for the Presentation layer and is still used today by some
protocols.
 the Application layer that contains the mechanisms that do not fit in neither the Presentation nor the Session
layer. The OSI Application layer was itself further divided in several generic service elements.
20 There is now a rough consensus for the greater use of the Unicode character format. Unicode can represent more than 100,000 different
characters from the known written languages on Earth. Maybe one day, all computers will only use Unicode to represent all their stored
characters and Unicode could become the standard format to exchange characters, but we are not yet at this stage today.
109
CHAPTER 3
Part 2: Protocols
112
113
The RD (recursion desired) bit is set by a client when it sends a query to a resolver. Such a query is said to be
recursive because the resolver will recurse through the DNS hierarchy to retrieve the answer on behalf of the client.
In the past, all resolvers were configured to perform recursive queries on behalf of any Internet host. However,
this exposes the resolvers to several security risks. The simplest one is that the resolver could become overloaded
by having too many recursive queries to process. As of this writing, most resolvers 1 only allow recursive queries
from clients belonging to their company or network and discard all other recursive queries. The RA bit indicates
whether the server supports recursion. The RCODE is used to distinguish between different types of errors. See
RFC 1035 for additional details. The last four fields indicate the size of the Question, Answer, Authority and
Additional sections of the DNS message.
The last four sections of the DNS message contain Resource Records (RR). All RRs have the same top level format
shown in the figure below.
114
NS record contains the name of the DNS server that is responsible for a given domain. For example, a query for
the AAAA record associated to the www.ietf.org name returns the following answer.
115
Records that it understands. This extensibility allowed the Domain Name System to evolve over the years while
still preserving the backward compatibility with already deployed DNS implementations.
. Several
 the cc: header line is used by the sender to provide a list of email addresses that must receive a carbon copy
of the message. Several addresses can be listed in this header line, separated by commas. All recipients of
the email message receive the To: and cc: header lines.
 the bcc: header line is used by the sender to provide a list of comma separated email addresses that must
receive a blind carbon copy of the message. The bcc: header line is not delivered to the recipients of the
email message.
A simple email message containing the From:, To:, Subject: and Date: header lines and two lines of body is shown
below.
From: Bob Smith <Bob@machine.example>
To: Alice Doe <alice@example.net>, Alice Smith <Alice@machine.example>
Subject: Hello
Date: Mon, 8 Mar 2010 19:55:06 -0600
This is the "Hello world" of email messages.
This is the second line of the body
Note the empty line after the Date: header line; this empty line contains only the CR and LF characters, and marks
the boundary between the header and the body of the message.
Several other optional header lines are defined in RFC 5322 and elsewhere 3 . Furthermore, many email clients
and servers define their own header lines starting from X-. Several of the optional header lines defined in RFC
5322 are worth being discussed here :
 the Message-Id: header line is used to associate a unique identifier to each email. Email identifiers are
usually structured like string@domain where string is a unique character string or sequence number chosen
by the sender of the email and domain the domain name of the sender. Since domain names are unique,
a host can generate globally unique message identifiers concatenating a locally unique identifier with its
domain name.
 the In-reply-to: is used when a message was created in reply to a previous message. In this case, the end of
the In-reply-to: line contains the identifier of the original message.
 the Received: header line is used when an email message is processed by several servers before reaching its
destination. Each intermediate email server adds a Received: header line. These header lines are useful to
debug problems in delivering email messages.
The figure below shows the header lines of one email message. The message originated at a host named
2
It could be surprising that the To: is not mandatory inside an email message. While most email messages will contain this header line an
email that does not contain a To: header line and that relies on the bcc: to specify the recipient is valid as well.
3 The list of all standard email header lines may be found at http://www.iana.org/assignments/message-headers/message-header-index.html
117
wira.firstpr.com.au and was received by smtp3.sgsi.ucl.ac.be. The Received: lines have been wrapped for readability.
Received: from smtp3.sgsi.ucl.ac.be (Unknown [10.1.5.3])
by mmp.sipr-dc.ucl.ac.be
(Sun Java(tm) System Messaging Server 7u3-15.01 64bit (built Feb 12 2010))
with ESMTP id <0KYY00L85LI5JLE0@mmp.sipr-dc.ucl.ac.be>; Mon,
08 Mar 2010 11:37:17 +0100 (CET)
Received: from mail.ietf.org (mail.ietf.org [64.170.98.32])
by smtp3.sgsi.ucl.ac.be (Postfix) with ESMTP id B92351C60D7; Mon,
08 Mar 2010 11:36:51 +0100 (CET)
Received: from [127.0.0.1] (localhost [127.0.0.1])
by core3.amsl.com (Postfix)
with ESMTP id F066A3A68B9; Mon, 08 Mar 2010 02:36:38 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
by core3.amsl.com (Postfix)
with ESMTP id A1E6C3A681B for <rrg@core3.amsl.com>; Mon,
08 Mar 2010 02:36:37 -0800 (PST)
Received: from mail.ietf.org ([64.170.98.32])
by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id erw8ih2v8VQa for <rrg@core3.amsl.com>; Mon,
08 Mar 2010 02:36:36 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123])
by core3.amsl.com (Postfix) with ESMTP id 03E893A67ED
for <rrg@irtf.org>; Mon,
08 Mar 2010 02:36:35 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6])
by gair.firstpr.com.au (Postfix) with ESMTP id D0A49175B63; Mon,
08 Mar 2010 21:36:37 +1100 (EST)
Date: Mon, 08 Mar 2010 21:36:38 +1100
From: Robin Whittle <rw@firstpr.com.au>
Subject: Re: [rrg] Recommendation and what happens next
In-reply-to: <C7B9C21A.4FAB%tony.li@tony.li>
To: RRG <rrg@irtf.org>
Message-id: <4B94D336.7030504@firstpr.com.au>
Message content removed
Initially, email was used to exchange small messages of ASCII text between computer scientists. However, with
the growth of the Internet, supporting only ASCII text became a severe limitation for two reasons. First of all,
non-English speakers wanted to write emails in their native language that often required more characters than
those of the ASCII character table. Second, many users wanted to send other content than just ASCII text by
email such as binary files, images or sound.
To solve this problem, the IETF developed the Multipurpose Internet Mail Extensions (MIME). These extensions
were carefully designed to allow Internet email to carry non-ASCII characters and binary files without breaking
the email servers that were deployed at that time. This requirement for backward compatibility forced the MIME
designers to develop extensions to the existing email message format RFC 822 instead of defining a completely
new format that would have been better suited to support the new types of emails.
RFC 2045 defines three new types of header lines to support MIME :
 The MIME-Version: header indicates the version of the MIME specification that was used to encode the
email message. The current version of MIME is 1.0. Other versions of MIME may be defined in the future.
Thanks to this header line, the software that processes email messages will be able to adapt to the MIME
version used to encode the message. Messages that do not contain this header are supposed to be formatted
according to the original RFC 822 specification.
 The Content-Type: header line indicates the type of data that is carried inside the message (see below)
 The Content-Transfer-Encoding: header line is used to specify how the message has been encoded. When
MIME was designed, some email servers were only able to process messages containing characters encoded
using the 7 bits ASCII character set. MIME allows the utilisation of other character encodings.
Inside the email header, the Content-Type: header line indicates how the MIME email message is structured. RFC
2046 defines the utilisation of this header line. The two most common structures for MIME messages are :
118
 Content-Type: multipart/mixed. This header line indicates that the MIME message contains several independent parts. For example, such a message may contain a part in plain text and a binary file.
 Content-Type: multipart/alternative. This header line indicates that the MIME message contains several
representations of the same information. For example, a multipart/alternative message may contain both a
plain text and an HTML version of the same text.
To support these two types of MIME messages, the recipient of a message must be able to extract the different
parts from the message. In RFC 822, an empty line was used to separate the header lines from the body. Using an
empty line to separate the different parts of an email body would be difficult as the body of email messages often
contains one or more empty lines. Another possible option would be to define a special line, e.g. *-LAST_LINE-*
to mark the boundary between two parts of a MIME message. Unfortunately, this is not possible as some emails
may contain this string in their body (e.g. emails sent to students to explain the format of MIME messages). To
solve this problem, the Content-Type: header line contains a second parameter that specifies the string that has
been used by the sender of the MIME message to delineate the different parts. In practice, this string is often
chosen randomly by the mail client.
The email message below, copied from RFC 2046 shows a MIME message containing two parts that are both in
plain text and encoded using the ASCII character set. The string simple boundary is defined in the Content-Type:
header as the marker for the boundary between two successive parts. Another example of MIME messages may
be found in RFC 2046.
Date: Mon, 20 Sep 1999 16:33:16 +0200
From: Nathaniel Borenstein <nsb@bellcore.com>
To: Ned Freed <ned@innosoft.com>
Subject: Test
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="simple boundary"
preamble, to be ignored
--simple boundary
Content-Type: text/plain; charset=us-ascii
First part
--simple boundary
Content-Type: text/plain; charset=us-ascii
Second part
--simple boundary
The Content-Type: header can also be used inside a MIME part. In this case, it indicates the type of data placed
in this part. Each data type is specified as a type followed by a subtype. A detailed description may be found in
RFC 2046. Some of the most popular Content-Type: header lines are :
 text. The message part contains information in textual format. There are several subtypes : text/plain for
regular ASCII text, text/html defined in RFC 2854 for documents in HTML format or the text/enriched
format defined in RFC 1896. The Content-Type: header line may contain a second parameter that specifies
the character set used to encode the text. charset=us-ascii is the standard ASCII character table. Other
frequent character sets include charset=UTF8 or charset=iso-8859-1. The list of standard character sets is
maintained by IANA
 image. The message part contains a binary representation of an image. The subtype indicates the format of
the image such as gif, jpg or png.
 audio. The message part contains an audio clip. The subtype indicates the format of the audio clip like wav
or mp3
 video. The message part contains a video clip. The subtype indicates the format of the video clip like avi or
mp4
 application. The message part contains binary information that was produced by the particular application
listed as the subtype. Email clients use the subtype to launch the application that is able to decode the
119
Encoding
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
Value
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Encoding
R
S
T
U
V
W
X
Y
Z
a
b
c
d
e
f
g
h
Value
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Encoding
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
Value
51
52
53
54
55
56
57
58
59
60
61
62
63
Encoding
z
0
1
2
3
4
5
6
7
8
9
+
/
The example below, from RFC 4648, illustrates the Base64 encoding.
Input data
8-bit
6-bit
Decimal
Encoding
0x14fb9c03d97e
00010100 11111011 10011100 00000011 11011001 01111110
000101 001111 101110 011100 000000 111101 100101 111110
5 15 46 28 0 61 37 62
FPucA9l+
The last point to be discussed about base64 is what happens when the length of the sequence of bytes to be
encoded is not a multiple of three. In this case, the last group of bytes may contain one or two bytes instead of
three. Base64 reserves the = character as a padding character. This character is used once when the last group
120
contains two bytes and twice when it contains one byte as illustrated by the two examples below.
Input data
8-bit
6-bit
Decimal
Encoding
0x14
00010100
000101 000000
50
FA==
Input data
8-bit
6-bit
Decimal
Encoding
0x14b9
00010100 11111011
000101 001111 101100
5 15 44
FPs=
Now that we have explained the format of the email messages, we can discuss how these messages can be exchanged through the Internet. The figure below illustrates the protocols that are used when Alice sends an email
message to Bob. Alice prepares her email with an email client or on a webmail interface. To send her email to
Bob, Alices client will use the Simple Mail Transfer Protocol (SMTP) to deliver her message to her SMTP server.
Alices email client is configured with the name of the default SMTP server for her domain. There is usually at
least one SMTP server per domain. To deliver the message, Alices SMTP server must find the SMTP server that
contains Bobs mailbox. This can be done by using the Mail eXchange (MX) records of the DNS. A set of MX
records can be associated to each domain. Each MX record contains a numerical preference and the fully qualified
domain name of a SMTP server that is able to deliver email messages destined to all valid email addresses of this
domain. The DNS can return several MX records for a given domain. In this case, the server with the lowest
preference is used first. If this server is not reachable, the second most preferred server is used etc. Bobs SMTP
server will store the message sent by Alice until Bob retrieves it using a webmail interface or protocols such as the
Post Office Protocol (POP) or the Internet Message Access Protocol (IMAP).
121
The first four reply codes correspond to errors in the commands sent by the client. The fourth reply code would
be sent by the server when the client sends commands in an incorrect order (e.g. the client tries to send an email
before providing the destination address of the message). Reply code 220 is used by the server as the first message
when it agrees to interact with the client. Reply code 221 is sent by the server before closing the underlying
transport connection. Reply code 421 is returned when there is a problem (e.g. lack of memory/disk resources)
that prevents the server from accepting the transport connection. Reply code 250 is the standard positive reply that
indicates the success of the previous command. Reply codes 450 and 452 indicate that the destination mailbox
5 The first versions of SMTP used HELO as the first command sent by a client to a SMTP server. When SMTP was extended to support
newer features such as 8 bits characters, it was necessary to allow a server to recognise whether it was interacting with a client that supported
the extensions or not. EHLO became mandatory with the publication of RFC 2821.
122
is temporarily unavailable, for various reasons, while reply code 550 indicates that the mailbox does not exist or
cannot be used for policy reasons. Reply code 354 indicates that the client can start transmitting its email message.
The transfer of an email message is performed in three phases. During the first phase, the client opens a transport
connection with the server. Once the connection has been established, the client and the server exchange greetings
messages (EHLO command). Most servers insist on receiving valid greeting messages and some of them drop the
underlying transport connection if they do not receive a valid greeting. Once the greetings have been exchanged,
the email transfer phase can start. During this phase, the client transfers one or more email messages by indicating
the email address of the sender (MAIL FROM: command), the email address of the recipient (RCPT TO: command)
followed by the headers and the body of the email message (DATA command). Once the client has finished sending
all its queued email messages to the SMTP server, it terminates the SMTP association (QUIT command).
A successful transfer of an email message is shown below
S:
C:
S:
C:
S:
C:
S:
C:
S:
C:
C:
C:
C:
C:
C:
C:
C:
C:
C:
S:
C:
S:
In the example above, the MTA running on mta.example.org opens a TCP connection to the SMTP server on host
smtp.example.com. The lines prefixed with S: (resp. C:) are the responses sent by the server (resp. the commands
sent by the client). The server sends its greetings as soon as the TCP connection has been established. The client
then sends the EHLO command with its fully qualified domain name. The server replies with reply-code 250 and
sends its greetings. The SMTP association can now be used to exchange an email.
To send an email, the client must first provide the address of the recipient with RCPT TO:. Then it uses the MAIL
FROM: with the address of the sender. Both the recipient and the sender are accepted by the server. The client
can now issue the DATA command to start the transfer of the email message. After having received the 354 reply
code, the client sends the headers and the body of its email message. The client indicates the end of the message
by sending a line containing only the . (dot) character 6 . The server confirms that the email message has been
queued for delivery or transmission with a reply code of 250. The client issues the QUIT command to close the
session and the server confirms with reply-code 221, before closing the TCP connection.
Note: Open SMTP relays and spam
Since its creation in 1971, email has been a very useful tool that is used by many users to exchange lots of
information. In the early days, all SMTP servers were open and anyone could use them to forward emails towards
their final destination. Unfortunately, over the years, some unscrupulous users have found ways to use email for
marketing purposes or to send malware. The first documented abuse of email for marketing purposes occurred in
1978 when a marketer who worked for a computer vendor sent a marketing email to many ARPANET users. At
that time, the ARPANET could only be used for research purposes and this was an abuse of the acceptable use
policy. Unfortunately, given the extremely low cost of sending emails, the problem of unsolicited emails has not
6 This implies that a valid email message cannot contain a line with one dot followed by CR and LF. If a user types such a line in an email,
his email client will automatically add a space character before or after the dot when sending the message over SMTP.
123
stopped. Unsolicited emails are now called spam and a study carried out by ENISA in 2009 reveals that 95% of
email was spam and this number seems to continue to grow. This places a burden on the email infrastructure of
Internet Service Providers and large companies that need to process many useless messages.
Given the amount of spam messages, SMTP servers are no longer open RFC 5068. Several extensions to SMTP
have been developed in recent years to deal with this problem. For example, the SMTP authentication scheme
defined in RFC 4954 can be used by an SMTP server to authenticate a client. Several techniques have also been
proposed to allow SMTP servers to authenticate the messages sent by their users RFC 4870 RFC 4871 .
124
S:
C:
S:
S:
S:
S:
C:
S:
S:
S:
C:
S:
C:
S:
+OK 2 620
LIST
+OK 2 messages (620 octets)
1 120
2 500
.
RETR 1
+OK 120 octets
<the POP3 server sends message 1>
.
DELE 1
+OK message 1 deleted
QUIT
+OK POP3 server signing off (1 message left)
In this example, a POP client contacts a POP server on behalf of the user named alice. Note that in this example,
Alices password is sent in clear by the client. This implies that if someone is able to capture the packets sent by
Alice, he will know Alices password 7 . Then Alices client issues the STAT command to know the number of
messages that are stored in her mailbox. It then retrieves and deletes the first message of the mailbox.
RFC 1939 defines the APOP authentication scheme that is not vulnerable to such attacks.
125
=
=
=
=
=
The first component of a URI is its scheme. A scheme can be seen as a selector, indicating the meaning of the
fields after it. In practice, the scheme often identifies the application-layer protocol that must be used by the client
to retrieve the document, but it is not always the case. Some schemes do not imply a protocol at all and some do
not indicate a retrievable document 8 . The most frequent scheme is http that will be described later. A URI scheme
can be defined for almost any application layer protocol [#furilist]_. The characters : and // follow the scheme
of any URI.
8 An example of a non-retrievable URI is urn:isbn:0-380-81593-1 which is an unique identifier for a book, through the urn scheme
(see RFC 3187). Of course, any URI can be made retrievable via a dedicated server or a new protocol but this one has no explicit protocol. Same thing for the scheme tag (see RFC 4151), often used in Web syndication (see RFC 4287 about the Atom syndication format).
Even when the scheme is retrievable (for instance with http), it is often used only as an identifier, not as a way to get a resource. See
http://norman.walsh.name/2006/07/25/namesAndAddresses for a good explanation.
126
The second part of the URI is the authority. With retrievable URI, this includes the DNS name or the IP address
of the server where the document can be retrieved using the protocol specified via the scheme. This name can
be preceded by some information about the user (e.g. a user name) who is requesting the information. Earlier
definitions of the URI allowed the specification of a user name and a password before the @ character (RFC
1738), but this is now deprecated as placing a password inside a URI is insecure. The host name can be followed
by the semicolon character and a port number. A default port number is defined for some protocols and the port
number should only be included in the URI if a non-default port number is used (for other protocols, techniques
like service DNS records are used).
The third part of the URI is the path to the document. This path is structured as filenames on a Unix host (but
it does not imply that the files are indeed stored this way on the server). If the path is not specified, the server
will return a default document. The last two optional parts of the URI are used to provide a query and indicate a
specific part (e.g. a section in an article) of the requested document. Sample URIs are shown below.
http://tools.ietf.org/html/rfc3986.html
mailto:infobot@example.com?subject=current-issue
http://docs.python.org/library/basehttpserver.html?highlight=http#BaseHTTPServer.BaseHTTPRequestHa
telnet://[2001:db8:3080:3::2]:80/
ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
The first URI corresponds to a document named rfc3986.html that is stored on the server named tools.ietf.org and
can be accessed by using the http protocol on its default port. The second URI corresponds to an email message,
with subject current-issue, that will be sent to user infobot in domain example.com. The mailto: URI scheme is
defined in RFC 6068. The third URI references the portion BaseHTTPServer.BaseHTTPRequestHandler of the
document basehttpserver.html that is stored in the library directory on server docs.python.org. This document can
be retrieved by using the http protocol. The query highlight=http is associated to this URI. The fourth example is a
server that operates the telnet protocol, uses IPv6 address 2001:db8:3080:3::2 and is reachable on port 80. The last
URI is somewhat special. Most users will assume that it corresponds to a document stored on the cnn.example.com
server. However, to parse this URI, it is important to remember that the @ character is used to separate the user
name from the host name in the authorisation part of a URI. This implies that the URI points to a document named
top_story.htm on host having IPv4 address 10.0.0.1. The document will be retrieved by using the ftp protocol with
the user name set to cnn.example.com&story=breaking_news.
The second component of the word wide web is the HyperText Markup Language (HTML). HTML defines the
format of the documents that are exchanged on the web. The first version of HTML was derived from the Standard
Generalized Markup Language (SGML) that was standardised in 1986 by ISO. SGML was designed to allow
large project documents in industries such as government, law or aerospace to be shared efficiently in a machinereadable manner. These industries require documents to remain readable and editable for tens of years and insisted
on a standardised format supported by multiple vendors. Today, SGML is no longer widely used beyond specific
applications, but its descendants including HTML and XML are now widespread.
A markup language is a structured way of adding annotations about the formatting of the document within the
document itself. Example markup languages include troff, which is used to write the Unix man pages or Latex.
HTML uses markers to annotate text and a document is composed of HTML elements. Each element is usually
composed of three items: a start tag that potentially includes some specific attributes, some text (often including
other elements), and an end tag. A HTML tag is a keyword enclosed in angle brackets. The generic form of a
HTML element is
<tag>Some text to be displayed</tag>
More complex HTML elements can also include optional attributes in the start tag
<tag attribute1="value1" attribute2="value2">some text to be displayed</tag>
The HTML document shown below is composed of two parts : a header, delineated by the <head> and </head>
markers, and a body (between the <body> and </body> markers). In the example below, the header only contains
a title, but other types of information can be included in the header. The body contains an image, some text and a
list with three hyperlinks. The image is included in the web page by indicating its URI between brackets inside the
<img src=...> marker. The image can, of course, reside on any server and the client will automatically download
it when rendering the web page. The <h1>...</h1> marker is used to specify the first level of headings. The <ul>
marker indicates an unnumbered list while the <li> marker indicates a list item. The <a href=URI>text</a>
127
indicates a hyperlink. The text will be underlined in the rendered web page and the client will fetch the specified
URI when the user clicks on the link.
 the GET method is the most popular one. It is used to retrieve a document from a server. The
GET method is encoded as GET followed by the path of the URI of the requested document and
the version of HTTP used by the client. For example, to retrieve the http://www.w3.org/MarkUp/
URI, a client must open a TCP on port 80 with host www.w3.org and send a HTTP request
containing the following line :
GET /MarkUp/ HTTP/1.0
 the HEAD method is a variant of the GET method that allows the retrieval of the header lines
for a given URI without retrieving the entire document. It can be used by a client to verify if a
document exists, for instance.
 the POST method can be used by a client to send a document to a server. The sent document is
attached to the HTTP request as a MIME document.
HTTP clients and servers can include many different HTTP headers in HTTP requests and responses. Each HTTP
header is encoded as a single ASCII-line terminated by CR and LF. Several of these headers are briefly described
below. A detailed discussion of all standard headers may be found in RFC 1945. The MIME headers can appear
in both HTTP requests and HTTP responses.
 the Content-Length: header is the MIME header that indicates the length of the MIME document in bytes.
 the Content-Type: header is the MIME header that indicates the type of the attached MIME document.
HTML pages use the text/html type.
 the Content-Encoding: header indicates how the MIME document has been encoded. For example, this
header would be set to x-gzip for a document compressed using the gzip software.
RFC 1945 and RFC 2616 define headers that are specific to HTTP responses. These server headers include :
 the Server: header indicates the version of the web server that has generated the HTTP response. Some
servers provide information about their software release and optional modules that they use. For security
reasons, some system administrators disable these headers to avoid revealing too much information about
their server to potential attackers.
 the Date: header indicates when the HTTP response has been produced by the server.
 the Last-Modified: header indicates the date and time of the last modification of the document attached to
the HTTP response.
Similarly, the following header lines can only appear inside HTTP requests sent by a client :
 the User-Agent: header provides information about the client that has generated the HTTP request. Some
servers analyse this header line and return different headers and sometimes different documents for different
user agents.
 the If-Modified-Since: header is followed by a date. It enables clients to cache in memory or on disk the
recent or most frequently used documents. When a client needs to request a URI from a server, it first checks
whether the document is already in its cache. If it is, the client sends a HTTP request with the If-ModifiedSince: header indicating the date of the cached document. The server will only return the document attached
to the HTTP response if it is newer than the version stored in the clients cache.
 the Referrer: header is followed by a URI. It indicates the URI of the document that the client visited before
sending this HTTP request. Thanks to this header, the server can know the URI of the document containing
the hyperlink followed by the client, if any. This information is very useful to measure the impact of
advertisements containing hyperlinks placed on websites.
 the Host: header contains the fully qualified domain name of the URI being requested.
Note: The importance of the Host: header line
The first version of HTTP did not include the Host: header line. This was a severe limitation for web hosting companies. For example consider a web hosting company that wants to serve both web.example.com and
www.example.net on the same physical server. Both web sites contain a /index.html document. When a client
sends a request for either http://web.example.com/index.html or http://www.example.net/index.html, the HTTP 1.0
request contains the following line :
129
By parsing this line, a server cannot determine which index.html file is requested.
Thanks to the
Host: header line, the server knows whether the request is for http://web.example.com/index.html or
http://www.dummy.net/index.html. Without the Host: header, this is impossible. The Host: header line allowed
web hosting companies to develop their business by supporting a large number of independent web servers on the
same physical server.
The status line of the HTTP response begins with the version of HTTP used by the server (usually HTTP/1.0
defined in RFC 1945 or HTTP/1.1 defined in RFC 2616) followed by a three digit status code and additional
information in English. HTTP status codes have a similar structure as the reply codes used by SMTP.
 All status codes starting with digit 2 indicate a valid response. 200 Ok indicates that the HTTP request was
successfully processed by the server and that the response is valid.
 All status codes starting with digit 3 indicate that the requested document is no longer available on the
server. 301 Moved Permanently indicates that the requested document is no longer available on this server.
A Location: header containing the new URI of the requested document is inserted in the HTTP response.
304 Not Modified is used in response to an HTTP request containing the If-Modified-Since: header. This
status line is used by the server if the document stored on the server is not more recent than the date indicated
in the If-Modified-Since: header.
 All status codes starting with digit 4 indicate that the server has detected an error in the HTTP request sent
by the client. 400 Bad Request indicates a syntax error in the HTTP request. 404 Not Found indicates that
the requested document does not exist on the server.
 All status codes starting with digit 5 indicate an error on the server. 500 Internal Server Error indicates that
the server could not process the request due to an error on the server itself.
In both the HTTP request and the HTTP response, the MIME document refers to a representation of the document
with the MIME headers indicating the type of document and its size.
As an illustration of HTTP/1.0, the transcript below shows a HTTP request for http://www.ietf.org and the corresponding HTTP response. The HTTP request was sent using the curl command line tool. The User-Agent: header
line contains more information about this client software. There is no MIME document attached to this HTTP
request, and it ends with a blank line.
GET / HTTP/1.0
User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3
Host: www.ietf.org
The HTTP response indicates the version of the server software used with the modules included. The LastModified: header indicates that the requested document was modified about one week before the request. A
HTML document (not shown) is attached to the response. Note the blank line between the header of the HTTP
response and the attached MIME document. The Server: header line has been truncated in this output.
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2010 13:40:38 GMT
Server: Apache/2.2.4 (Linux/SUSE) mod_ssl/2.2.4 OpenSSL/0.9.8e (truncated)
Last-Modified: Tue, 09 Mar 2010 21:26:53 GMT
Content-Length: 17019
Content-Type: text/html
<!DOCTYPE HTML PUBLIC .../HTML>
HTTP was initially designed to share self-contained text documents. For this reason, and to ease the implementation of clients and servers, the designers of HTTP chose to open a TCP connection for each HTTP request.
This implies that a client must open one TCP connection for each URI that it wants to retrieve from a server as
illustrated on the figure below. For a web page containing only text documents this was a reasonable design choice
as the client usually remains idle while the (human) user is reading the retrieved document.
However, as the web evolved to support richer documents containing images, opening a TCP connection for each
URI became a performance problem [Mogul1995]. Indeed, besides its HTML part, a web page may include
130
131
GET / HTTP/1.1
Host: www.kame.net
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us)
Connection: keep-alive
The server replies with the Connection: Keep-Alive header and indicates that it accepts a maximum of 100 HTTP
requests over this connection and that it will close the connection if it remains idle for 15 seconds.
HTTP/1.1 200 OK
Date: Fri, 19 Mar 2010 09:23:37 GMT
Server: Apache/2.0.63 (FreeBSD) PHP/5.2.12 with Suhosin-Patch
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Length: 3462
Content-Type: text/html
<html>...
</html>
The client sends a second request for the style sheet of the retrieved web page.
GET /style.css HTTP/1.1
Host: www.kame.net
Referer: http://www.kame.net/
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us)
Connection: keep-alive
The server replies with the requested style sheet and maintains the persistent connection. Note that the server only
accepts 99 remaining HTTP requests over this persistent connection.
HTTP/1.1 200 OK
Date: Fri, 19 Mar 2010 09:23:37 GMT
Server: Apache/2.0.63 (FreeBSD) PHP/5.2.12 with Suhosin-Patch
Last-Modified: Mon, 10 Apr 2006 05:06:39 GMT
Content-Length: 2235
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/css
...
Then the client automatically requests the web servers icon 9 , that could be displayed by the browser. This server
does not contain such URI and thus replies with a 404 HTTP status. However, the underlying TCP connection is
not closed immediately.
GET /favicon.ico HTTP/1.1
Host: www.kame.net
Referer: http://www.kame.net/
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us)
Connection: keep-alive
HTTP/1.1 404 Not Found
Date: Fri, 19 Mar 2010 09:23:40 GMT
Server: Apache/2.0.63 (FreeBSD) PHP/5.2.12 with Suhosin-Patch
Content-Length: 318
Keep-Alive: timeout=15, max=98
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> ...
9 Favorite icons are small icons that are used to represent web servers in the toolbar of Internet browsers. Microsoft added this feature
in their browsers without taking into account the W3C standards. See http://www.w3.org/2005/10/howto-favicon for a discussion on how to
cleanly support such favorite icons.
132
As illustrated above, a client can send several HTTP requests over the same persistent TCP connection. However,
it is important to note that all of these HTTP requests are considered to be independent by the server. Each HTTP
request must be self-contained. This implies that each request must include all the header lines that are required
by the server to understand the request. The independence of these requests is one of the important design choices
of HTTP. As a consequence of this design choice, when a server processes a HTTP request, it doesnt use any
other information than what is contained in the request itself. This explains why the client adds its User-Agent:
header in all of the HTTP requests it sends over the persistent TCP connection.
However, in practice, some servers want to provide content tuned for each user. For example, some servers
can provide information in several languages or other servers want to provide advertisements that are targeted to
different types of users. To do this, servers need to maintain some information about the preferences of each user
and use this information to produce content matching the users preferences. HTTP contains several mechanisms
that enable to solve this problem. We discuss three of them below.
A first solution is to force the users to be authenticated. This was the solution used by FTP to control the files that
each user could access. Initially, user names and passwords could be included inside URIs RFC 1738. However,
placing passwords in the clear in a potentially publicly visible URI is completely insecure and this usage has now
been deprecated RFC 3986. HTTP supports several extension headers RFC 2617 that can be used by a server
to request the authentication of the client by providing his/her credentials. However, user names and passwords
have not been popular on web servers as they force human users to remember one user name and one password
per server. Remembering a password is acceptable when a user needs to access protected content, but users will
not accept the need for a user name and password only to receive targeted advertisements from the web sites that
they visit.
A second solution to allow servers to tune that content to the needs and capabilities of the user is to rely on
the different types of Accept-* HTTP headers. For example, the Accept-Language: can be used by the client to
indicate its preferred languages. Unfortunately, in practice this header is usually set based on the default language
of the browser and it is not possible for a user to indicate the language it prefers to use by selecting options on
each visited web server.
The third, and widely adopted, solution are HTTP cookies. HTTP cookies were initially developed as a private
extension by Netscape. They are now part of the standard RFC 6265. In a nutshell, a cookie is a short string that
is chosen by a server to represent a given client. Two HTTP headers are used : Cookie: and Set-Cookie:. When a
server receives an HTTP request from a new client (i.e. an HTTP request that does not contain the Cookie: header),
it generates a cookie for the client and includes it in the Set-Cookie: header of the returned HTTP response. The
Set-Cookie: header contains several additional parameters including the domain names for which the cookie is
valid. The client stores all received cookies on disk and every time it sends a HTTP request, it verifies whether
it already knows a cookie for this domain. If so, it attaches the Cookie: header to the HTTP request. This is
illustrated in the figure below with HTTP 1.1, but cookies also work with HTTP 1.0.
133
134
The caller encodes each data in the appropriate sequence and the callee decodes the received information. Here
are a few examples extracted from RFC 1832 to illustrate how this encoding/decoding can be performed.
For basic data types, RFC 1832 simply maps their representation into a sequence of bytes. For example a 32 bits
integer is transmitted as follows (with the most significant byte first, which corresponds to big-endian encoding).
XDR also supports 64 bits integers and booleans. The booleans are mapped onto integers (0 for false and 1 for
true). For the floating point numbers, the encoding defined in the IEEE standard is used.
In this representation, the first bit (S) is the sign (0 represents positive). The next 11 bits represent the exponent of
the number (E), in base 2, and the remaining 52 bits are the fractional part of the number (F). The floating point
number that corresponds to this representation is (1)  21023  1. . XDR also allows to encode complex
data types. A first example is the string of bytes. A string of bytes is composed of two parts : a length (encoded
as an integer) and a sequence of bytes. For performance reasons, the encoding of a string is aligned to 32 bits
boundaries. This implies that some padding bytes may be inserted during the encoding operation is the length of
the string is not a multiple of 4. The structure of the string is shown below (source RFC 1832).
In some situations, it is necessary to encode fixed or variable length arrays. XDR RFC 1832 supports such
arrays. For example, the encoding below corresponds to a variable length array containing n elements. The
encoded representation starts with an integer that contains the number of elements and follows with all elements
in sequence. It is also possible to encode a fixed-length array. In this case, the first integer is missing.
XDR also supports the definition of unions, structures, ... Additional details are provided in RFC 1832.
A second popular method to encode data is the JavaScript Object Notation (JSON). This syntax was initially
defined to allow applications written in JavaScript to exchange data, but it has now wider usages. JSON RFC
4627 is a text-based representation. The simplest data type is the integer. It is represented as a sequence of
digits in ASCII. Strings can also be encoding by using JSON. A JSON string always starts and ends with a quote
character () as in the C language. As in the C language, some characters (like  or \) must be escaped if they
appear in a string. RFC 4627 describes this in details. Booleans are also supported by using the strings false and
true. Like XDR, JSON supports more complex data types. A structure or object is defined as a comma separated
list of elements enclosed in curly brackets. RFC 4627 provides the following example as an illustration.
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url":
"http://www.example.com/image/481989943",
"Height": 125,
135
"Width":
},
"ID": 1234
100
}
}
This object has one field named Image. It has five attributes. The first one, Width, is an integer set to 800. The
third one is a string. The fourth attribute, Thumbnail is also an object composed of three different attributes, one
string and two integers. JSON can also be used to encode arrays or lists. In this case, square brackets are used as
delimiters. The snippet below shows an array which contains the prime integers that are smaller than ten.
{
"Primes" : [ 2, 3, 5, 7 ]
}
Compared with XDR, the main advantage of JSON is that the transfer syntax is easily readable by a human.
However, this comes at the expense of a less compact encoding. Some data encoded in JSON will usually take
more space than when it is encoded with XDR. More compact encoding schemes have been defined, see e.g.
[BH2013] and the references therein.
Upon reception of this JSON structure, the callee parses the object, locates the corresponding method and passes
the parameters. This method returns a response which is also encoded as a JSON structure. This response contains
the following information :
 jsonrpc: a string indicating the version of the protocol used to encode the response
 id: the same identifier as the identifier chosen by the caller
 result: if the request succeeded, this member contains the result of the request (in our example, value 4).
 error: if the method called does not exist or its execution causes an error, the result element will be replaced
by an error element which contains the following members :
 code: a number that indicates the type of error. Several error codes are defined in [JSON-RPC2]. For
example, -32700 indicates an error in parsing the request, -32602 indicates invalid parameters and
-32601 indicates that the method could not be found on the server. Other error codes are listed in
[JSON-RPC2].
136
 message: a string (limited to one sentence) that provides a short description of the error.
 data: an optional field that provides additional information about the error.
Coming back to our example with the call for the sum procedure, it would return the following JSON structure.
{ "jsonrpc": "2.0", "result": 4, "id": 1}
If the sum method is not implemented on the server, it would reply with the following response.
{ "jsonrpc": "2.0", "error": {"code": -32601, "message": "Method not found"}, "id": "1"}
The id field, which is present in the request and the response plays the same role as the identifier field in the
DNS message. It allows the caller to match the response with the request that it sent. This id is very important
when JSON-RPC is used over the connectionless service which is unreliable. If a request is sent, it may need to
be retransmitted and it is possible that a callee will receive twice the same request (e.g. if the response for the
first request was lost). In the DNS, when a request is lost, it can be retransmitted without causing any difficulty.
However with remote procedure calls in general, losses can cause some problems. Consider a method which is
used to deposit money on a bank account. If the request is lost, it will be retransmitted and the deposit will be
eventually performed. However, if the response is lost, the caller will also retransmit its request. This request will
be received by the callee that will deposit the money again. To prevent this problem from affecting the application,
either the programmer must ensure that the remote procedures that it calls can be safely called multiple times or the
application must verify whether the request has been transmitted earlier. In most deployments, the programmers
use remote methods that can be safely called multiple times without breaking the application logic.
ONC-RPC uses a more complex method to allow a caller to reach the callee. On a host, server processes can run on
different ports and given the limited number of port values (216 per host on the Internet), it is impossible to reserve
one port number for each method. The solution used in ONC-RPC RFC 1831 is to use a special method which is
called the portmapper RFC 1833. The portmapper is a kind of directory that runs on a server that hosts methods.
The portmapper runs on a standard port (111 for ONC-RPC RFC 1833). A server process that implements a
method registers its method on the local portmapper. When a caller needs to call a method on a remote server, it
first contacts the portmapper to obtain the port number of the server which implements the method. The response
from the portmapper allows it to directly contact the server process which implements the method.
137
138
may
be
found
in
139
TCP provides a reliable bytestream, connection-oriented transport service on top of the unreliable connectionless
network service provided by IP. TCP is used by a large number of applications, including :
 Email (SMTP, POP, IMAP)
 World wide web ( HTTP, ...)
 Most file transfer protocols ( ftp, peer-to-peer file sharing applications , ...)
 remote computer access : telnet, ssh, X11, VNC, ...
 non-interactive multimedia applications : flash
On the global Internet, most of the applications used in the wide area rely on TCP. Many studies 13 have reported
that TCP was responsible for more than 90% of the data exchanged in the global Internet.
To provide this service, TCP relies on a simple segment format that is shown in the figure below. Each TCP
segment contains a header described below and, optionally, a payload. The default length of the TCP header is
twenty bytes, but some TCP headers contain options.
140
 the flags field contains a set of bit flags that indicate how a segment should be interpreted by the TCP entity
receiving it :
 the SYN flag is used during connection establishment
 the FIN flag is used during connection release
 the RST is used in case of problems or when an invalid segment has been received
 when the ACK flag is set, it indicates that the acknowledgment field contains a valid number. Otherwise, the content of the acknowledgment field must be ignored by the receiver
 the URG flag is used together with the Urgent pointer
 the PSH flag is used as a notification from the sender to indicate to the receiver that it should pass all
the data it has received to the receiving process. However, in practice TCP implementations do not
allow TCP users to indicate when the PSH flag should be set and thus there are few real utilizations of
this flag.
 the checksum field contains the value of the Internet checksum computed over the entire TCP segment and
a pseudo-header as with UDP
 the Reserved field was initially reserved for future utilization. It is now used by RFC 3168.
 the TCP Header Length (THL) or Data Offset field is a four bits field that indicates the size of the TCP
header in 32 bit words. The maximum size of the TCP header is thus 64 bytes.
 the Optional header extension is used to add optional information to the TCP header. Thanks to this header
extension, it is possible to add new fields to the TCP header that were not planned in the original specification. This allowed TCP to evolve since the early eighties. The details of the TCP header extension are
explained in sections TCP connection establishment and TCP reliable data transfer.
141
14
142
is chosen by the server, he can send a fake SYN segment and shortly after the fake ACK segment confirming the
reception of the SYN+ACK segment sent by the server. Once the TCP connection is open, he can use it to send
any command to the server. To counter this attack, current TCP implementations add randomness to the ISN. One
of the solutions, proposed in RFC 1948 is to compute the ISN as
ISN = M + H(localhost, localport, remotehost, remoteport, secret).
where M is the current value of the TCP clock and His a cryptographic hash function. localhost and remotehost
(resp. localport and remoteport ) are the IP addresses (port numbers) of the local and remote host and secret is a
random number only known by the server. This method allows the server to use different ISNs for different clients
at the same time. Measurements performed with the first implementations of this technique showed that it was
difficult to implement it correctly, but todays TCP implementation now generate good ISNs.
A server could, of course, refuse to open a TCP connection upon reception of a SYN segment. This refusal may be
due to various reasons. There may be no server process that is listening on the destination port of the SYN segment.
The server could always refuse connection establishments from this particular client (e.g. due to security reasons)
or the server may not have enough resources to accept a new TCP connection at that time. In this case, the server
would reply with a TCP segment having its RST flag set and containing the sequence number of the received SYN
segment as its acknowledgment number. This is illustrated in the figure below. We discuss the other utilizations
of the TCP RST flag later (see TCP connection release).
143
RCVD state. It remains in this state until it receives an ACK segment that acknowledges its SYN+ACK segment,
with this it then enters the Established state.
Apart from these two paths in the TCP connection establishment FSM, there is a third path that corresponds to the
case when both the client and the server send a SYN segment to open a TCP connection 16 . In this case, the client
and the server send a SYN segment and enter the SYN Sent state. Upon reception of the SYN segment sent by the
other host, they reply by sending a SYN+ACK segment and enter the SYN RCVD state. The SYN+ACK that arrives
from the other host allows it to transition to the Established state. The figure below illustrates such a simultaneous
establishment of a TCP connection.
16 Of course, such a simultaneous TCP establishment can only occur if the source port chosen by the client is equal to the destination
port chosen by the server. This may happen when a host can serve both as a client as a server or in peer-to-peer applications when the
communicating hosts do not use ephemeral port numbers.
144
Sending a packet with a different source IP address than the address allocated to the host is called sending a spoofed packet.
145
146
 window. a TCP receiver uses this 16 bits field to indicate the current size of its receive window expressed
in bytes.
Note: The Transmission Control Block
For each established TCP connection, a TCP implementation must maintain a Transmission Control Block (TCB).
A TCB contains all the information required to send and receive segments on this connection RFC 793. This
includes 19 :
 the local IP address
 the remote IP address
 the local TCP port number
 the remote TCP port number
 the current state of the TCP FSM
 the maximum segment size (MSS)
 snd.nxt : the sequence number of the next byte in the byte stream (the first byte of a new data segment that
you send uses this sequence number)
 snd.una : the earliest sequence number that has been sent but has not yet been acknowledged
 snd.wnd : the current size of the sending window (in bytes)
 rcv.nxt : the sequence number of the next byte that is expected to be received from the remote host
 rcv.wnd : the current size of the receive window advertised by the remote host
 sending buffer : a buffer used to store all unacknowledged data
 receiving buffer : a buffer to store all data received from the remote host that has not yet been delivered
to the user. Data may be stored in the receiving buffer because either it was not received in sequence or
because the user is too slow to process it
The original TCP specification can be categorised as a transport protocol that provides a byte stream service and
uses go-back-n.
To send new data on an established connection, a TCP entity performs the following operations on the corresponding TCB. It first checks that the sending buffer does not contain more data than the receive window advertised by
the remote host (rcv.wnd). If the window is not full, up to MSS bytes of data are placed in the payload of a TCP
segment. The sequence number of this segment is the sequence number of the first byte of the payload. It is set to
the first available sequence number : snd.nxt and snd.nxt is incremented by the length of the payload of the TCP
segment. The acknowledgement number of this segment is set to the current value of rcv.nxt and the window field
of the TCP segment is computed based on the current occupancy of the receiving buffer. The data is kept in the
sending buffer in case it needs to be retransmitted later.
When a TCP segment with the ACK flag set is received, the following operations are performed. rcv.wnd is set
to the value of the window field of the received segment. The acknowledgement number is compared to snd.una.
The newly acknowledged data is removed from the sending buffer and snd.una is updated. If the TCP segment
contained data, the sequence number is compared to rcv.nxt. If they are equal, the segment was received in
sequence and the data can be delivered to the user and rcv.nxt is updated. The contents of the receiving buffer is
checked to see whether other data already present in this buffer can be delivered in sequence to the user. If so,
rcv.nxt is updated again. Otherwise, the segments payload is placed in the receiving buffer.
Segment transmission strategies
In a transport protocol such as TCP that offers a bytestream, a practical issue that was left as an implementation
choice in RFC 793 is to decide when a new TCP segment containing data must be sent. There are two simple and
extreme implementation choices. The first implementation choice is to send a TCP segment as soon as the user
19 A complete TCP implementation contains additional information in its TCB, notably to support the urgent pointer. However, this part of
TCP is not discussed in this book. Refer to RFC 793 and RFC 2140 for more details about the TCB.
147
has requested the transmission of some data. This allows TCP to provide a low delay service. However, if the
user is sending data one byte at a time, TCP would place each user byte in a segment containing 20 bytes of TCP
header 20 . This is a huge overhead that is not acceptable in wide area networks. A second simple solution would
be to only transmit a new TCP segment once the user has produced MSS bytes of data. This solution reduces the
overhead, but at the cost of a potentially very high delay.
An elegant solution to this problem was proposed by John Nagle in RFC 896. John Nagle observed that the
overhead caused by the TCP header was a problem in wide area connections, but less in local area connections
where the available bandwidth is usually higher. He proposed the following rules to decide to send a new data
segment when a new data has been produced by the user or a new ack segment has been received
if rcv.wnd>= MSS and len(data) >= MSS :
send one MSS-sized segment
else
if there are unacknowledged data:
place data in buffer until acknowledgement has been received
else
send one TCP segment containing all buffered data
The first rule ensures that a TCP connection used for bulk data transfer always sends full TCP segments. The
second rule sends one partially filled TCP segment every round-trip-time.
This algorithm, called the Nagle algorithm, takes a few lines of code in all TCP implementations. These lines of
code have a huge impact on the packets that are exchanged in TCP/IP networks. Researchers have analysed the
distribution of the packet sizes by capturing and analysing all the packets passing through a given link. These
studies have shown several important results :
 in TCP/IP networks, a large fraction of the packets are TCP segments that contain only an acknowledgement.
These packets usually account for 40-50% of the packets passing through the studied link
 in TCP/IP networks, most of the bytes are exchanged in long packets, usually packets containing about 1440
bytes of payload which is the default MSS for hosts attached to an Ethernet network, the most popular type
of LAN
Recent measurements indicate that these packet size distributions are still valid in todays Internet, although the
packet distribution tends to become bimodal with small packets corresponding to TCP pure acks and large 1440bytes packets carrying most of the user data [SMASU2012].
Maximum Throughput
524 Mbps
52.4 Mbps
5.24 Mbps
1.05 Mbps
To solve this problem, a backward compatible extension that allows TCP to use larger receive windows was
proposed in RFC 1323. Today, most TCP implementations support this option. The basic idea is that instead of
storing snd.wnd and rcv.wnd as 16 bits integers in the TCB, they should be stored as 32 bits integers. As the TCP
segment header only contains 16 bits to place the window field, it is impossible to copy the value of snd.wnd in
each sent TCP segment. Instead the header contains snd.wnd >> S where S is the scaling factor ( 0    14)
20 This TCP segment is then placed in an IP header. We describe IPv6 in the next chapter. The minimum size of the IPv6 (resp. IPv4)
header is 40 bytes (resp. 20 bytes).
21 A precise estimation of the maximum bandwidth that can be achieved by a TCP connection should take into account the overhead of the
TCP and IP headers as well.
148
negotiated during connection establishment. The client adds its proposed scaling factor as a TCP option in the
SYN segment. If the server supports RFC 1323, it places in the SYN+ACK segment the scaling factor that it uses
when advertising its own receive window. The local and remote scaling factors are included in the TCB. If the
server does not support RFC 1323, it ignores the received option and no scaling is applied.
By using the window scaling extensions defined in RFC 1323, TCP implementations can use a receive buffer
of up to 1 GByte. With such a receive buffer, the maximum throughput that can be achieved by a single TCP
connection becomes :
RTT
1 msec
10 msec
100 msec
500 msec
Maximum Throughput
8590 Gbps
859 Gbps
86 Gbps
17 Gbps
These throughputs are acceptable in todays networks. However, there are already servers having 10 Gbps interfaces... Early TCP implementations had fixed receiving and sending buffers 22 . Todays high performance
implementations are able to automatically adjust the size of the sending and receiving buffer to better support high
bandwidth flows [SMM1998]
149
Figure 3.27: Disambiguating round-trip-time measurements with the RFC 1323 timestamp option
Once the round-trip-time measurements have been collected for a given TCP connection, the TCP entity must
compute the retransmission timeout. As the round-trip-time measurements may change during the lifetime of a
connection, the retransmission timeout may also change. At the beginning of a connection 25 , the TCP entity that
sends a SYN segment does not know the round-trip-time to reach the remote host and the initial retransmission
timeout is usually set to 3 seconds RFC 2988.
The original TCP specification proposed in RFC 793 to include two additional variables in the TCB :
 srtt : the smoothed round-trip-time computed as  = (  ) + ((1  )  ) where rtt is the
round-trip-time measured according to the above procedure and  a smoothing factor (e.g. 0.8 or 0.9)
 rto : the retransmission timeout is computed as  = (60, (1,   )) where  is used to take
into account the delay variance (value : 1.3 to 2.0). The 60 and 1 constants are used to ensure that the rto is
not larger than one minute nor smaller than 1 second.
24
Some security experts have raised concerns that using the real-time clock to set the TSval in the timestamp option can leak information
such as the systems up-time. Solutions proposed to solve this problem may be found in [CNPI09]
25 As a TCP client often establishes several parallel or successive connections with the same server, RFC 2140 has proposed to reuse for
a new connection some information that was collected in the TCB of a previous connection, such as the measured rtt. However, this solution
has not been widely implemented.
150
However, in practice, this computation for the retransmission timeout did not work well. The main problem was
that the computed rto did not correctly take into account the variations in the measured round-trip-time. Van Jacobson proposed in his seminal paper [Jacobson1988] an improved algorithm to compute the rto and implemented
it in the BSD Unix distribution. This algorithm is now part of the TCP standard RFC 2988.
Jacobsons algorithm uses two state variables, srtt the smoothed rtt and rttvar the estimation of the variance of
the rtt and two parameters :  and . When a TCP connection starts, the first rto is set to 3 seconds. When a first
estimation of the rtt is available, the srtt, rttvar and rto are computed as follows :
srtt=rtt
rttvar=rtt/2
rto=srtt+4*rttvar
Then, when other rtt measurements are collected, srtt and rttvar are updated as follows :
 = (1  )   +   |  |
 = (1  )   +   
 =  + 4  
The proposed values for the parameters are  = 81 and  = 41 . This allows a TCP implementation, implemented
in the kernel, to perform the rtt computation by using shift operations instead of the more costly floating point
operations [Jacobson1988]. The figure below illustrates the computation of the rto upon rtt changes.
151
returns empty TCP segments whose only useful information is their acknowledgement number. This may cause
a large overhead in wide area network if a pure ACK segment is sent in response to each received data segment.
Most TCP implementations use a delayed acknowledgement strategy. This strategy ensures that piggybacking is
used whenever possible, otherwise pure ACK segments are sent for every second received data segments when
there are no losses. When there are losses or reordering, ACK segments are more important for the sender and
they are sent immediately RFC 813 RFC 1122. This strategy relies on a new timer with a short delay (e.g. 50
milliseconds) and one additional flag in the TCB. It can be implemented as follows :
reception of a data segment:
if pkt.seq==rcv.nxt:
# segment received in sequence
if delayedack :
send pure ack segment
cancel acktimer
delayedack=False
else:
delayedack=True
start acktimer
else:
# out of sequence segment
send pure ack segment
if delayedack:
delayedack=False
cancel acktimer
transmission of a data segment:
if delayedack:
delayedack=False
cancel acktimer
# piggyback ack
acktimer expiration:
send pure ack segment
delayedack=False
Due to this delayed acknowledgement strategy, during a bulk transfer, a TCP implementation usually acknowledges every second TCP segment received.
The default go-back-n retransmission strategy used by TCP has the advantage of being simple to implement, in
particular on the receiver side, but when there are losses, a go-back-n strategy provides a lower performance than
a selective repeat strategy. The TCP developers have designed several extensions to TCP to allow it to use a
selective repeat strategy while maintaining backward compatibility with older TCP implementations. These TCP
extensions assume that the receiver is able to buffer the segments that it receives out-of-sequence.
The first extension that was proposed is the fast retransmit heuristic. This extension can be implemented on TCP
senders and thus does not require any change to the protocol. It only assumes that the TCP receiver is able to
buffer out-of-sequence segments.
From a performance point of view, one issue with TCPs retransmission timeout is that when there are isolated
segment losses, the TCP sender often remains idle waiting for the expiration of its retransmission timeouts. Such
isolated losses are frequent in the global Internet [Paxson99]. A heuristic to deal with isolated losses without
waiting for the expiration of the retransmission timeout has been included in many TCP implementations since
the early 1990s. To understand this heuristic, let us consider the figure below that shows the segments exchanged
over a TCP connection when an isolated segment is lost.
As shown above, when an isolated segment is lost the sender receives several duplicate acknowledgements since
the TCP receiver immediately sends a pure acknowledgement when it receives an out-of-sequence segment. A
duplicate acknowledgement is an acknowledgement that contains the same acknowledgement number as a previous
segment. A single duplicate acknowledgement does not necessarily imply that a segment was lost, as a simple
reordering of the segments may cause duplicate acknowledgements as well. Measurements [Paxson99] have
shown that segment reordering is frequent in the Internet. Based on these observations, the fast retransmit heuristic
has been included in most TCP implementations. It can be implemented as follows :
ack arrival:
if tcp.ack==snd.una:
dupacks++
152
# duplicate acknowledgement
if dupacks==3:
retransmit segment(snd.una)
else:
dupacks=0
# process acknowledgement
This heuristic requires an additional variable in the TCB (dupacks). Most implementations set the default number
of duplicate acknowledgements that trigger a retransmission to 3. It is now part of the standard TCP specification
RFC 2581. The fast retransmit heuristic improves the TCP performance provided that isolated segments are lost
and the current window is large enough to allow the sender to send three duplicate acknowledgements.
The figure below illustrates the operation of the fast retransmit heuristic.
153
blocks.
154
 a non-SYN segment was received for a non-existing TCP connection RFC 793
 by extension, some implementations respond with an RST segment to a segment that is received on an
existing connection but with an invalid header RFC 3360. This causes the corresponding connection to be
closed and has caused security attacks RFC 4953
 by extension, some implementations send an RST segment when they need to close an existing TCP connection (e.g. because there are not enough resources to support this connection or because the remote host
is considered to be unreachable). Measurements have shown that this usage of TCP RST is widespread
[AW05]
When an RST segment is sent by a TCP entity, it should contain the current value of the sequence number for the
connection (or 0 if it does not belong to any existing connection) and the acknowledgement number should be set
to the next expected in-sequence sequence number on this connection.
Note: TCP RST wars
TCP implementers should ensure that two TCP entities never enter a TCP RST war where host A is sending a RST
segment in response to a previous RST segment that was sent by host B in response to a TCP RST segment sent by
host A ... To avoid such an infinite exchange of RST segments that do not carry data, a TCP entity is never allowed
to send a RST segment in response to another RST segment.
The normal way of terminating a TCP connection is by using the graceful TCP connection release. This mechanism uses the FIN flag of the TCP header and allows each host to release its own direction of data transfer. As for
the SYN flag, the utilisation of the FIN flag in the TCP header consumes one sequence number. The figure FSM
for TCP connection release shows the part of the TCP FSM used when a TCP connection is released.
155
an acknowledgement of its FIN segment (i.e. sequence number ( + 1)232 ), but may receive a FIN segment
sent by the remote host. In the first case, the TCP connection enters the FIN_WAIT2 state. In this state, new data
segments from the remote host are still accepted until the reception of the FIN segment. The acknowledgement
for this FIN segment is sent once all data received before the FIN segment have been delivered to the user and
the connection enters the TIME_WAIT state. In the second case, a FIN segment is received and the connection
enters the Closing state once all data received from the remote host have been delivered to the user. In this state,
no new data segments can be sent and the host waits for an acknowledgement of its FIN segment before entering
the TIME_WAIT state.
The TIME_WAIT state is different from the other states of the TCP FSM. A TCP entity enters this state after
having sent the last ACK segment on a TCP connection. This segment indicates to the remote host that all the
data that it has sent have been correctly received and that it can safely release the TCP connection and discard
the corresponding TCB. After having sent the last ACK segment, a TCP connection enters the TIME_WAIT and
remains in this state for 2 *   seconds. During this period, the TCB of the connection is maintained. This
ensures that the TCP entity that sent the last ACK maintains enough state to be able to retransmit this segment
if this ACK segment is lost and the remote host retransmits its last FIN segment or another one. The delay of
2 *   seconds ensures that any duplicate segments on the connection would be handled correctly without
causing the transmission of an RST segment. Without the TIME_WAIT state and the 2 *   seconds delay, the
connection release would not be graceful when the last ACK segment is lost.
Note: TIME_WAIT on busy TCP servers
The 2 *   seconds delay in the TIME_WAIT state is an important operational problem on servers having
thousands of simultaneously opened TCP connections [FTY99]. Consider for example a busy web server that
processes 10.000 TCP connections every second. If each of these connections remain in the TIME_WAIT state
for 4 minutes, this implies that the server would have to maintain more than 2 million TCBs at any time. For this
reason, some TCP implementations prefer to perform an abrupt connection release by sending a RST segment to
close the connection [AW05] and immediately discard the corresponding TCB. However, if the RST segment is
lost, the remote host continues to maintain a TCB for a connection no longer exists. This optimisation reduces the
number of TCBs maintained by the host sending the RST segment but at the potential cost of increased processing
on the remote host when the RST segment is lost.
156
a message-mode service will require a lot of effort. It seems unlikely as of this writing to expect old applications
to be rewritten to fully support SCTP and use it. However, some new applications are considering using SCTP
instead of TCP. Voice over IP signaling protocols are a frequently cited example. The Real-Time Communication in Web-browsers working group is also considering the utilization of SCTP for some specific data channels
[JLT2013]. From a service viewpoint, a second advantage of SCTP compared to TCP is its ability to support
several simultaneous streams. Consider a web application that needs to retrieve five objects from a remote server.
With TCP, one possibility is to open one TCP connection for each object, send a request over each connection and
retrieve one object per connection. This is the solution used by HTTP/1.0 as explained earlier. The drawback of
this approach is that the application needs to maintain several concurrent TCP connections. Another solution is
possible with HTTP/1.1 [NGB+1997] . With HTTP/1.1, the client can use pipelining to send several HTTP Requests without waiting for the answer of each request. The server replies to these requests in sequence, one after
the other. If the server replies to the requests in the sequence, this may lead to head-of-line blocking problems.
Consider that the objects different sizes. The first object is a large 10 MBytes image while the other objects are
small javascript files. In this case, delivering the objects in sequence will cause a very long delay for the javascript
files since they will only be transmitted once the large image has been sent.
With SCTP, head-of-line blocking can be mitigated. SCTP can open a single connection and divide it in five logical
streams so that the five objects are sent in parallel over the single connection. SCTP controls the transmission of
the segments over the connection and ensures that the data is delivered efficiently to the application. In the example
above, the small javascript files could be delivered as independent messages before the large image.
Another extension to SCTP RFC 3758 supports partially-reliable delivery. With this extension, an SCTP sender
can be instructed to expire data based on one of several events, such as a timeout, the sender can signal the SCTP
receiver to move on without waiting for the expired data. This partially reliable service could be useful to provide
timed delivery for example. With this service, there is an upper limit on the time required to deliver a message to
the receiver. If the transport layer cannot deliver the data within the specified delay, the data is discarded by the
sender without causing any stall in the stream.
157
the user is itself a chunk. The SCTP chunks are a good example of a protocol format that can be easily extended.
Each chunk is encoded as four fields shown in the figure below.
INIT, Itag=1234
INIT-ACK,cookie,ITag=5678
The first segment contains the INIT chunk. To establish an SCTP connection with a server, the client first creates
some local state for this connection. The most important parameter of the INIT chunk is the Initiation tag. This
value is a random number that is used to identify the connection on the client host for its entire lifetime. This
Initiation tag is placed as the Verification tag in all segments sent by the server. This is an important change
compared to TCP where only the source and destination ports are used to identify a given connection. The INIT
chunk may also contain the other addresses owned by the client. The server responds by sending an INIT-ACK
chunk. This chunk also contains an Initiation tag chosen by the server and a copy of the Initiation tag chosen by
the client. The INIT and INIT-ACK chunks also contain an initial sequence number. A key difference between
TCPs three-way handshake and SCTPs four-way handshake is that an SCTP server does not create any state
when receiving an INIT chunk. For this, the server places inside the INIT-ACK reply a State cookie chunk.
This State cookie is an opaque block of data that contains information computed from the INIT and INIT-ACK
chunks that the server would have had stored locally, some lifetime information and a signature. The format of
the State cookie is flexible and the server could in theory place almost any information inside this chunk. The
only requirement is that the State cookie must be echoed back by the client to confirm the establishment of the
158
connection. Upon reception of the COOKIE-ECHO chunk, the server verifies the signature of the State cookie.
The client may provide some user data and an initial sequence number inside the COOKIE-ECHO chunk. The
server then responds with a COOKIE-ACK chunk that acknowledges the COOKIE-ECHO chunk. The SCTP
connection between the client and the server is now established. This four-way handshake is both more secure
and more flexible than the three-way handshake used by TCP. The detailed formats of the INIT, INIT-ACK,
COOKIE-ECHO and COOKIE-ACK chunks may be found in RFC 4960.
159
SHUTDOWN(TSN=last)
SHUTDOWN-ACK
SHUTDOWN-COMPLETE
160
Note that in contrast with TCPs four-way handshake, the utilisation of a three-way handshake to close an SCTP
connection implies that the client (resp. server) may close the connection when the application at the other end
has still some data to transmit. Upon reception of the SHUTDOWN chunk, an SCTP entity must stop accepting new
data from the application, but it still needs to retransmit the unacknowledged data chunks (the SHUTDOWN chunk
may be placed in the same segment as a Sack chunk that indicates gaps in the received chunks).
SCTP also provides the equivalent to TCPs RST segment. The ABORT chunk can be used to refuse a connection,
react to the reception of an invalid segment or immediately close a connection (e.g. due to lack of resources).
by the receiver.
TCPs congestion control scheme is based on a congestion window. The current value of the congestion window
(cwnd) is stored in the TCB of each TCP connection and the window that can be used by the sender is constrained
by (, , ) where  is the current sending window and  the last received receive window. The Additive Increase part of the TCP congestion control increments the congestion window by MSS bytes
every round-trip-time. In the TCP literature, this phase is often called the congestion avoidance phase. The Multiplicative Decrease part of the TCP congestion control divides the current value of the congestion window once
congestion has been detected.
When a TCP connection begins, the sending host does not know whether the part of the network that it uses
to reach the destination is congested or not. To avoid causing too much congestion, it must start with a small
congestion window. [Jacobson1988] recommends an initial window of MSS bytes. As the additive increase part
of the TCP congestion control scheme increments the congestion window by MSS bytes every round-trip-time,
the TCP connection may have to wait many round-trip-times before being able to efficiently use the available
bandwidth. This is especially important in environments where the    product is high. To avoid
waiting too many round-trip-times before reaching a congestion window that is large enough to efficiently utilise
the network, the TCP congestion control scheme includes the slow-start algorithm. The objective of the TCP
slow-start phase is to quickly reach an acceptable value for the cwnd. During slow-start, the congestion window
is doubled every round-trip-time. The slow-start algorithm uses an additional variable in the TCB : sstresh (slowstart threshold). The ssthresh is an estimation of the last value of the cwnd that did not cause congestion. It is
initialised at the sending window and is updated after each congestion event.
A key question that must be answered by any congestion control scheme is how congestion is detected. The
first implementations of the TCP congestion control scheme opted for a simple and pragmatic approach : packet
losses indicate congestion. If the network is congested, router buffers are full and packets are discarded. In
wired networks, packet losses are mainly caused by congestion. In wireless networks, packets can be lost due to
transmission errors and for other reasons that are independent of congestion. TCP already detects segment losses
to ensure a reliable delivery. The TCP congestion control scheme distinguishes between two types of congestion :
 mild congestion. TCP considers that the network is lightly congested if it receives three duplicate acknowledgements and performs a fast retransmit. If the fast retransmit is successful, this implies that only one
161
segment has been lost. In this case, TCP performs multiplicative decrease and the congestion window is
divided by 2. The slow-start threshold is set to the new value of the congestion window.
 severe congestion. TCP considers that the network is severely congested when its retransmission timer
expires. In this case, TCP retransmits the first segment, sets the slow-start threshold to 50% of the congestion
window. The congestion window is reset to its initial value and TCP performs a slow-start.
The figure below illustrates the evolution of the congestion window when there is severe congestion. At the
beginning of the connection, the sender performs slow-start until the first segments are lost and the retransmission
timer expires. At this time, the ssthresh is set to half of the current congestion window and the congestion window
is reset at one segment. The lost segments are retransmitted as the sender again performs slow-start until the
congestion window reaches the sshtresh. It then switches to congestion avoidance and the congestion window
increases linearly until segments are lost and the retransmission timer expires ...
Figure 3.37: Evaluation of the TCP congestion window with severe congestion
The figure below illustrates the evolution of the congestion window when the network is lightly congested and
all lost segments can be retransmitted using fast retransmit. The sender begins with a slow-start. A segment is
lost but successfully retransmitted by a fast retransmit. The congestion window is divided by 2 and the sender
immediately enters congestion avoidance as this was a mild congestion.
Figure 3.38: Evaluation of the TCP congestion window when the network is lightly congested
Most TCP implementations update the congestion window when they receive an acknowledgement. If we assume
that the receiver acknowledges each received segment and the sender only sends MSS sized segments, the TCP
congestion control scheme can be implemented using the simplified pseudo-code 27 below.
# Initialization
cwnd = MSS # congestion window in bytes
ssthresh= swin # in bytes
# Ack arrival
if tcp.ack > snd.una : # new ack, no congestion
if cwnd < ssthresh :
27 In this pseudo-code, we assume that TCP uses unlimited sequence and acknowledgement numbers. Furthermore, we do not detail how
the cwnd is adjusted after the retransmission of the lost segment by fast retransmit. Additional details may be found in RFC 5681.
162
Furthermore when a TCP connection has been idle for more than its current retransmission timer, it should reset its
congestion window to the congestion window size that it uses when the connection begins, as it no longer knows
the current congestion state of the network.
Note: Initial congestion window
The original TCP congestion control mechanism proposed in [Jacobson1988] recommended that each TCP connection should begin by setting  =  . However, in todays higher bandwidth networks, using such a
small initial congestion window severely affects the performance for short TCP connections, such as those used
by web servers.In 2002, RFC 3390 allowed an initial congestion window of about 4 KBytes, which corresponds
to 3 segments in many environments. Recently, researchers from google proposed to further increase the initial
window up to 15 KBytes [DRC+2010]. The measurements that they collected show that this increase would not
significantly increase congestion but would significantly reduce the latency of short HTTP responses. Unsurprisingly, the chosen initial window corresponds to the average size of an HTTP response from a search engine.
This proposed modification has been adopted as an experimental modification in RFC 6928 and popular TCP
implementations support it.
163
bit was required to allow the routers to mark the packets they forward during congestion periods. In the IP network
layer, this bit is called the Congestion Experienced (CE) bit and is part of the packet header. However, using a
single bit to mark packets is not sufficient. Consider a simple scenario with two sources, one congested router
and one destination. Assume that the first sender and the destination support ECN, but not the second sender. If
the router is congested it will mark packets from both senders. The first sender will react to the packet markings
by reducing its transmission rate. However since the second sender does not support ECN, it will not react to the
markings. Furthermore, this sender could continue to increase its transmission rate, which would lead to more
packets being marked and the first source would decrease again its transmission rate, ... In the end, the sources
that implement ECN are penalized compared to the sources that do not implement it. This unfairness issue is a
major hurdle to widely deploy ECN on the public Internet 28 . The solution proposed in RFC 3168 to deal with
this problem is to use a second bit in the network packet header. This bit, called the ECN-capable transport (ECT)
bit, indicates whether the packet contains a segment produced by a transport protocol that supports ECN or not.
Transport protocols that support ECN set the ECT bit in all packets. When a router is congested, it first verifies
whether the ECT bit is set. In this case, the CE bit of the packet is set to indicate congestion. Otherwise, the packet
is discarded. This improves the deployability of ECN 29 .
The second difficulty is how to allow the receiver to inform the sender of the reception of network packets marked
with the CE bit. In reliable transport protocols like TCP and SCTP, the acknowledgements can be used to provide
this feedback. For TCP, two options were possible : change some bits in the TCP segment header or define a new
TCP option to carry this information. The designers of ECN opted for reusing spare bits in the TCP header. More
precisely, two TCP flags have been added in the TCP header to support ECN. The ECN-Echo (ECN) is set in the
acknowledgements when the CE was set in packets received on the forward path.
164
client
router
server
data[seq=1,ECT=1,CE=0]
data[seq=1,ECT=1,CE=1]
ack=2,ECE=1
ack=2,ECE=1
data[seq=x,ack=2,ECE=0,ECT=1,CE=0]
data[seq=x,ack=2,ECE=0,ECT=1,CE=0]
To solve this problem, RFC 3168 uses an additional bit in the TCP header : the Congestion Window Reduced
(CWR) bit.
client
router
server
data[seq=1,ECT=1,CE=0]
data[seq=1,ECT=1,CE=1]
ack=2,ECE=1
ack=2,ECE=1
data[seq=x,ack=2,ECE=1,ECT=1,CE=0]
data[seq=x,ack=2,ECE=1,ECT=1,CE=0]
data[seq=1,ECT=1,CE=0,CWR=1]
data[seq=1,ECT=1,CE=1,CWR=1]
The CWR bit of the TCP header provides some form of acknowledgement for the ECE bit. When a TCP receiver
detects a packet marked with the CE bit, it sets the ECE bit in all segments that it returns to the sender. Upon
reception of an acknowledgement with the ECE bit set, the sender reduces its congestion window to reflect a mild
congestion and sets the CWR bit. This bit remains set as long as the segments received contained the ECE bit set.
A sender should only react once per round-trip-time to marked packets.
SCTP uses a different approach to inform the sender once congestion has been detected. Instead of using one bit
to carry the congestion notification from the receiver to the sender, SCTP defines an entire ECN Echo chunk for
this. This chunk contains the lowest TSN that was received in a packet with the CE bit set and the number of
marked packets received. The SCTP CWR chunk allows to acknowledge the reception of an ECN Echo chunk. It
echoes the lowest TSN placed in the ECN Echo chunk.
The last point that needs to be discussed about Explicit Congestion Notification is the algorithm that is used by
routers to detect congestion. On a router, congestion manifests itself by the number of packets that are stored
inside the router buffers. As explained earlier, we need to distinguish between two types of routers :
 routers that have a single FIFO queue
3.10. Congestion control
165
166
the cycle starts again. If the congestion window is measured in MSS-sized segments, a cycle lasts 
2 round-triptimes. The bandwidth of the TCP connection is the number of bytes that have been transmitted during a given
period of time. During a cycle, the number of segments that are sent on the TCP connection is equal to the area of
the yellow trapeze in the figure. Its area is thus :
2
 = ( 
2 ) +
1
2
2
 (
2 ) =
3 2
8
3 2
3
 
8
  = 
=
or,
after
having
eliminated
W,
More detailed models and the analysis of simulations have shown that a first order model of the TCP throughput
  < ( 
 ,   )
167
The main objective of the network layer is to allow endsystems, connected to different networks, to exchange
information through intermediate systems called router. The unit of information in the network layer is called a
packet.
3.11.1 IP version 6
In the late 1980s and early 1990s the growth of the Internet was causing several operational problems on routers.
Many of these routers had a single CPU and up to 1 MByte of RAM to store their operating system, packet buffers
and routing tables. Given the rate of allocation of IPv4 prefixes to companies and universities willing to join the
Internet, the routing tables where growing very quickly and some feared that all IPv4 prefixes would quickly be
allocated. In 1987, a study cited in RFC 1752, estimated that there would be 100,000 networks in the near future.
In August 1990, estimates indicated that the class B space would be exhausted by March 1994. Two types of
solution were developed to solve this problem. The first short term solution was the introduction of Classless Inter
Domain Routing (CIDR). A second short term solution was the Network Address Translation (NAT) mechanism,
defined in RFC 1631. NAT allowed multiple hosts to share a single public IP address, it is explained in section
Middleboxes.
However, in parallel with these short-term solutions, which have allowed the IPv4 Internet to continue to be usable
until now, the Internet Engineering Task Force started to work on developing a replacement for IPv4. This work
started with an open call for proposals, outlined in RFC 1550. Several groups responded to this call with proposals
for a next generation Internet Protocol (IPng) :
 TUBA proposed in RFC 1347 and RFC 1561
 PIP proposed in RFC 1621
 SIPP proposed in RFC 1710
The IETF decided to pursue the development of IPng based on the SIPP proposal. As IP version 5 was already
used by the experimental ST-2 protocol defined in RFC 1819, the successor of IP version 4 is IP version 6. The
initial IP version 6 defined in RFC 1752 was designed based on the following assumptions :
 IPv6 addresses are encoded as a 128 bits field
 The IPv6 header has a simple format that can easily be parsed by hardware devices
 A host should be able to configure its IPv6 address automatically
 Security must be part of IPv6
Note: The IPng address size
When the work on IPng started, it was clear that 32 bits was too small to encode an IPng address and all proposals
used longer addresses. However, there were many discussions about the most suitable address length. A first
169
approach, proposed by SIP in RFC 1710, was to use 64 bit addresses. A 64 bits address space was 4 billion times
larger than the IPv4 address space and, furthermore, from an implementation perspective, 64 bit CPUs were being
considered and 64 bit addresses would naturally fit inside their registers. Another approach was to use an existing
address format. This was the TUBA proposal (RFC 1347) that reuses the ISO CLNP 20 bytes addresses. The
20 bytes addresses provided room for growth, but using ISO CLNP was not favored by the IETF partially due to
political reasons, despite the fact that mature CLNP implementations were already available. 128 bits appeared to
be a reasonable compromise at that time.
170
171
The drawback of PA addresses is that when a company using a PA address block changes its provider, it needs to
change all the addresses that it uses. This can be a nightmare from an operational perspective and many companies
are lobbying to obtain PI address blocks even if they are small and connected to a single provider. The typical size
of the IPv6 address blocks are :
 /32 for an Internet Service Provider
 /48 for a single company
 /56 for small user sites
 /64 for a single user (e.g. a home user connected via ADSL)
 /128 in the rare case when it is known that no more than one endhost will be attached
There is one difficulty with the utilisation of these IPv6 prefixes. Consider Belnet, the Belgian research ISP
that has been allocated the 2001:6a8::/32 prefix. Universities are connected to Belnet. UCL uses prefix
2001:6a8:3080::/48 while the University of Liege uses 2001:6a8:2d80::/32. A commercial ISP
uses prefix 2a02:2788::/32. Both Belnet and the commercial ISP are connected to the global Internet.
Belnet
2001:6a8::/32
ULg
2001:6a8:2d80::/48
ISP1
2a02:2788::/32
UCL
2001:6a8:3080::/48
alpha.com
The Belnet network advertises prefix 2001:6a8::/32 that includes the prefixes from both UCL and ULg.
These two subnetworks can be easily reached from any internet connected host. After a few years, UCL decides
to increase the redundancy of its Internet connectivity and buys transit service from ISP1. A direct link between
UCL and the commercial ISP appears on the network and UCL expects to receive packets from both Belnet and
the commercial ISP.
Now, consider how a router inside alpha.com would reach a host in the UCL network. This router has two
routes towards 2001:6a8:3080::1. The first one, for prefix 2001:6a8:3080::/48 is via the direct link
between the commercial ISP and UCL. The second one, for prefix 2001:6a8::/32 is via the Internet and
Belnet. Since RFC 1519 when a router knows several routes towards the same destination address, it must
forward packets along the route having the longest prefix length. In the case of 2001:6a8:3080::1, this is
the route 2001:6a8:3080::/48 that is used to forward the packet. This forwarding rule is called the longest
prefix match or the more specific match. All IP routers implement this forwarding rule.
To understand the longest prefix match forwarding, consider the IPv6 routing below.
Destination
::/0
::1
2a02:2788:2c4:16f::/64
2001:6a8:3080::/48
2001:6a8:2d80::/48
2001:6a8::/32
Gateway
fe80::dead:beef
::1
eth0
fe80::bad:cafe
fe80::bad:bad
fe80::aaaa:bbbb
With the longest match rule, the route ::/0 plays a particular role. As this route has a prefix length of 0 bits, it
matches all destination addresses. This route is often called the default route.
 a packet with destination 2a02:2788:2c4:16f::1 received by router R is destined to a host on interface
eth0 .
 a packet with destination 2001:6a8:3080::1234 matches three routes : ::/0, 2001::6a8::/32
and 2001::6a8:3080. The packet is forwarded via gateway fe80::bad:cafe
172
 a packet with destination 2001:1890:123a::1:1e matches one route : ::/0. The packet is forwarded
via fe80::dead:beef
 a packet with destination 2001:6a8:3880:40::2 matches two routes : 2001:6a8::/32 and ::/0. The
packet is forwarded via fe80::aaaa:bbbb
The longest prefix match can be implemented by using different data structures. One possibility is to use a trie.
Details on how to implement efficient packet forwarding algorithms may be found in [Varghese2005].
For the companies that want to use IPv6 without being connected to the IPv6 Internet, RFC 4193 defines the
Unique Local Unicast (ULA) addresses (fc00::/7). These ULA addresses play a similar role as the private
IPv4 addresses defined in RFC 1918. However, the size of the fc00::/7 address block allows ULA to be much
more flexible than private IPv4 addresses.
Furthermore, the IETF has reserved some IPv6 addresses for a special usage. The two most important ones are :
 0:0:0:0:0:0:0:1 (::1 in compact form) is the IPv6 loopback address. This is the address of a logical
interface that is always up and running on IPv6 enabled hosts.
 0:0:0:0:0:0:0:0 (\:\: in compact form) is the unspecified IPv6 address. This is the IPv6 address
that a host can use as source address when trying to acquire an official address.
The last type of unicast IPv6 addresses are the Link Local Unicast addresses. These addresses are part of the
fe80::/10 address block and are defined in RFC 4291. Each host can compute its own link local address by
concatenating the fe80::/64 prefix with the 64 bits identifier of its interface. Link local addresses can be used
when hosts that are attached to the same link (or local area network) need to exchange packets. They are used
notably for address discovery and auto-configuration purposes. Their usage is restricted to each link and a router
cannot forward a packet whose source or destination address is a link local address. Link local addresses have also
been defined for IPv4 RFC 3927. However, the IPv4 link local addresses are only used when a host cannot obtain
a regular IPv4 address, e.g. on an isolated LAN.
173
R1
B
R2
R3
R4
C
174
Assume that B and D are part of a multicast group. If A sends a multicast packet towards this group, then R1 will
replicate the packet to forward it to R2 and R3. R2 would forward the packet towards B. R3 would forward the
packet towards R4 that would deliver it to D.
Finally, RFC 4291 defines the structure of the IPv6 multicast addresses 31 . This structure is depicted in the figure
below
175
 version : a 4 bits field set to 6 and intended to allow IP to evolve in the future if needed
 Traffic class : this 8 bits field allows to indicate the type of service expected by this packet and contains the
CE and ECT flags that are used by Explicit Congestion Notification
 Flow label : this field was initially intended to be used to tag packets belonging to the same flow. A recent
document, RFC 6437 describes some possible usages of this field, but it is too early to tell whether it will
be really used.
 Payload length : this is the size of the packet payload in bytes. As the length is encoded as a 16 bits field,
an IPv6 packet can contain up to 65535 bytes of payload.
 Hop Limit : this 8 bits field indicates the number of routers that can forward the packet. It is decremented
by one by each router and prevents packets from looping forever inside the network.
 Next Header : this 8 bits field indicates the type 32 of header that follows the IPv6 header. It can be a
transport layer header (e.g. 6 for TCP or 17 for UDP) or an IPv6 option.
It is interesting to note that there is no checksum inside the IPv6 header. This is mainly because all datalink layers
and transport protocols include a checksum or a CRC to protect their frames/segments against transmission errors.
Adding a checksum in the IPv6 header would have forced each router to recompute the checksum of all packets,
with limited benefit in detecting errors. In practice, an IP checksum allows for catching errors that occur inside
routers (e.g. due to memory corruption) before the packet reaches its destination. However, this benefit was found
to be too small given the reliability of current memories and the cost of computing the checksum on each router
33
.
When a host receives an IPv6 packet, it needs to determine which transport protocol (UDP, TCP, SCTP, ...) needs
to handle the payload of the packet. This is the first role of the Next header field. The IANA which manages the
allocation of Internet ressources and protocol parameters, maintains an official list of transport protocols 2 . The
following protocol numbers are reserved :
 TCP uses Next Header number 6
 UDP uses Next Header number 17
 SCTP uses Next Header number 132
For example, an IPv6 packet that contains an SCTP segment would appear as shown in the figure below. However,
The IANA maintains the list of all allocated Next Header types at http://www.iana.org/assignments/protocol-numbers/
When IPv4 was designed, the situation was different. The IPv4 header includes a checksum that only covers the network header. This
checksum is computed by the source and updated by all intermediate routers that decrement the TTL, which is the IPv4 equivalent of the
HopLimit used by IPv6.
33
176
the Next header has broader usages than simply indicating the transport protocol which is responsible for the
packet payload. An IPv6 packet can contain a chain of headers and the last one indicates the transport protocol
that is responsible for the packet payload. Supporting a chain of headers is a clever design from an extensibility
viewpoint. As we will seen, this chain of headers has several usages.
RFC 2460 defines several types of IPv6 extension headers that could be added to an IPv6 packet :
 Hop-by-Hop Options header. This option is processed by routers and endhosts.
 Destination Options header. This option is processed only by endhosts.
 Routing header. This option is processed by some nodes.
 Fragment header. This option is processed only by endhosts.
 Authentication header. This option is processed only by endhosts.
 Encapsulating Security Payload. This option is processed only by endhosts.
The last two headers are used to add security above IPv6 and implement IPSec. They are described in RFC 2402
and RFC 2406 and are outside the scope of this document.
The Hop-by-Hop Options header was designed to allow IPv6 to be easily extended. In theory, this option could
be used to define new fields that were not foreseen when IPv6 was designed. It is intended to be processed by
both routers and endhosts. Deploying an extension to a network protocol can be difficult in practice since some
nodes already support the extensions while others still use the old version and do not understand the extension.
To deal with this issue, the IPv6 designers opted for a Type-Length-Value encoding of these IPv6 options. The
Hop-by-Hop Options header is encoded as shown below.
177
The Fragment Options header is more important. An important problem in the network layer is the ability to
handle heterogeneous datalink layers. Most datalink layer technologies can only transmit and receive frames
that are shorter than a given maximum frame size. Unfortunately, all datalink layer technologies use different
maximum frames sizes.
Each datalink layer has its own characteristics and as indicated earlier, each datalink layer is characterised by
a maximum frame size. From IPs point of view, a datalink layer interface is characterised by its Maximum
Transmission Unit (MTU). The MTU of an interface is the largest packet (including header) that it can send. The
table below provides some common MTU sizes.
Datalink layer
Ethernet
WiFi
ATM (AAL5)
802.15.4
Token Ring
FDDI
MTU
1500 bytes
2272 bytes
9180 bytes
102 or 81 bytes
4464 bytes
4352 bytes
Although IPv6 can send 64 KBytes long packets, few datalink layer technologies that are used today are able to
send a 64 KBytes packet inside a frame. Furthermore, as illustrated in the figure below, another problem is that a
host may send a packet that would be too large for one of the datalink layers used by the intermediate routers.
178
 the 32 bits Identification field indicates to which original packet a fragment belongs. When a host sends
fragmented packets, it should ensure that it does not reuse the same identification field for packets sent to
the same destination during a period of MSL seconds. This is easier with the 32 bits identification used in
the IPv6 fragmentation header, than with the 16 bits identification field of the IPv4 header.
Some IPv6 implementations send the fragments of a packet in increasing fragment offset order, starting from the
first fragment. Others send the fragments in reverse order, starting from the last fragment. The latter solution can
be advantageous for the host that needs to reassemble the fragments, as it can easily allocate the buffer required to
reassemble all fragments of the packet upon reception of the last fragment. When a host receives the first fragment
of an IPv6 packet, it cannot know a priori the length of the entire IPv6 packet.
The figure below provides an example of a fragmented IPv6 packet containing a UDP segment. The Next Header
type reserved for the IPv6 fragmentation option is 44.
179
In the above pseudocode, we maintain a single 32 bits counter that is incremented for each packet that needs
to be fragmented. Other implementations to compute the packet identification are possible. RFC 2460 only
requires that two fragmented packets that are sent within the MSL between the same pair of hosts have different
identifications.
The fragments of an IPv6 packet may arrive at the destination in any order, as each fragment is forwarded independently in the network and may follow different paths. Furthermore, some fragments may be lost and never
reach the destination.
The reassembly algorithm used by the destination host is roughly as follows. First, the destination can verify
whether a received IPv6 packet is a fragment or not by checking whether it contains a fragment header. If so,
all fragments with the some identification must be reassembled together. The reassembly algorithm relies on
the Identification field of the received fragments to associate a fragment with the corresponding packet being
reassembled. Furthermore, the Fragment Offset field indicates the position of the fragment payload in the original
unfragmented packet. Finally, the packet with the M flag reset allows the destination to determine the total length
of the original unfragmented packet.
Note that the reassembly algorithm must deal with the unreliability of the IP network. This implies that a fragment
may be duplicated or a fragment may never reach the destination. The destination can easily detect fragment
duplication thanks to the Fragment Offset. To deal with fragment losses, the reassembly algorithm must bound the
time during which the fragments of a packet are stored in its buffer while the packet is being reassembled. This
can be implemented by starting a timer when the first fragment of a packet is received. If the packet has not been
reassembled upon expiration of the timer, all fragments are discarded and the packet is considered to be lost.
Note: Header compression on low bandwidth links
Given the size of the IPv6 header, it can cause huge overhead on low bandwidth links, especially when small
packets are exchanged such as for Voice over IP applications. In such environments, several techniques can be
used to reduce the overhead. A first solution is to use data compression in the datalink layer to compress all the
information exchanged [Thomborson1992]. These techniques are similar to the data compression algorithms used
in tools such as compress(1) or gzip(1) RFC 1951. They compress streams of bits without taking advantage
of the fact that these streams contain IP packets with a known structure. A second solution is to compress the IP
and TCP header. These header compression techniques, such as the one defined in RFC 5795 take advantage of
the redundancy found in successive packets from the same flow to significantly reduce the size of the protocol
headers. Another solution is to define a compressed encoding of the IPv6 header that matches the capabilities of
the underlying datalink layer RFC 4944.
The last type of IPv6 header extension is the Routingheader. The type 0 routing header defined in RFC 2460
is an example of an IPv6 option that must be processed by some routers. This option is encoded as shown below.
The type 0 routing option was intended to allow a host to indicate a loose source route that should be followed by
a packet by specifying the addresses of some of the routers that must forward this packet. Unfortunately, further
work with this routing header, including an entertaining demonstration with scapy [BE2007] , revealed severe
security problems with this routing header. For this reason, loose source routing with the type 0 routing header
has been removed from the IPv6 specification RFC 5095.
180
181
 1 [Destination Unreachable. Such an ICMPv6 message is sent when the destination address of a packet
is unreachable. The code field of the ICMP header contains additional information about the type of
unreachability. The following codes are specified in RFC 4443]
 0 : No route to destination. This indicates that the router that sent the ICMPv6 message did not
have a route towards the packets destination
 1 : Communication with destination administratively prohibited. This indicates that a firewall has
refused to forward the packet towards its final destination.
 2 : Beyond scope of source address. This message can be sent if the source is using link-local
addresses to reach a global unicast address outside its subnet.
 3 : Address unreachable. This message indicates that the packet reached the subnet of the destination, but the host that owns this destination address cannot be reached.
 4 : Port unreachable. This message indicates that the IPv6 packet was received by the destination,
but there was no application listening to the specified port.
 2 : Packet Too Big. The router that was to send the ICMPv6 message received an IPv6 packet that is larger
than the MTU of the outgoing link. The ICMPv6 message contains the MTU of this link in bytes. This
allows the sending host to implement Path MTU discovery RFC 1981
 3 : Time Exceeded. This error message can be sent either by a router or by a host. A router would set code
to 0 to report the reception of a packet whose Hop Limit reached 0. A host would set code to 1 to report that
it was unable to reassemble received IPv6 fragments.
 4 : Parameter Problem. This ICMPv6 message is used to report either the reception of an IPv6 packet with
an erroneous header field (type 0) or an unknown Next Header or IP option (types 1 and 2). In this case, the
message body contains the erroneous IPv6 packet and the first 32 bits of the message body contain a pointer
to the error.
The Destination Unreachable ICMP error message is returned when a packet cannot be forwarded to its final
destination. The first four ICMPv6 error messages (type 1, codes 0-3) are generated by routers while endhosts
may return code 4 when there is no application bound to the corresponding port number.
The Packet Too Big ICMP messages enable the source endhost to discover the MTU size that it can safely use to
reach a given destination. To understand its operation, consider the (academic) scenario shown in the figure below.
In this figure, the labels on each link represent the maximum packet size supported by this link.
182
A
1500
R1
1400
R2
1300
R3
1500
183
If A sends a 1500 bytes packet, R1 will return an ICMPv6 error message indicating a maximum packet length of
1400 bytes. A would then fragment the packet before retransmitting it. The small fragment would go through, but
the large fragment will be refused by R2 that would return an ICMPv6 error message. A can refragment the packet
and send it to the final destination as two fragments.
In practice, an IPv6 implementation does not store the transmitted packets to be able to retransmit them if needed.
However, since TCP (and SCTP) buffer the segments that they transmit, a similar approach can be used in transport
protocols to detect the maximum MTU on a path towards a given destination. This technique is called PathMTU
Discovery RFC 1981.
When a TCP segment is transported in an IP packet that is fragmented in the network, the loss of a single fragment
forces TCP to retransmit the entire segment (and thus all the fragments). If TCP was able to send only packets
that do not require fragmentation in the network, it could retransmit only the information that was lost in the
network. In addition, IP reassembly causes several challenges at high speed as discussed in RFC 4963. Using IP
fragmentation to allow UDP applications to exchange large messages raises several security issues [KPS2003].
ICMPv6 is used by TCP implementations to discover the largest MTU size that is allowed to reach a destination
host without causing network fragmentation. A TCP implementation parses the Packets Too Big ICMP messages that it receives. These ICMP messages contain the MTU of the routers outgoing link in their Data field.
Upon reception of such an ICMP message, the source TCP implementation adjusts its Maximum Segment Size
(MSS) so that the packets containing the segments that it sends can be forwarded by this router without requiring
fragmentation.
Two types of informational ICMPv6 messages are defined in RFC 4443 : echo request and echo reply, which are
used to test the reachability of a destination by using ping6(8). Each host is supposed 34 to reply with an ICMP
Echo reply message when its receives an ICMP Echo request message. A sample usage of ping6(8) is shown
below.
#ping6 www.ietf.org
PING6(56=40+8+8 bytes) 2001:6a8:3080:2:3403:bbf4:edae:afc3 --> 2001:1890:123a::1:1e
16 bytes from 2001:1890:123a::1:1e, icmp_seq=0 hlim=49 time=156.905 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=1 hlim=49 time=155.618 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=2 hlim=49 time=155.808 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=3 hlim=49 time=155.325 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=4 hlim=49 time=155.493 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=5 hlim=49 time=155.801 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=6 hlim=49 time=155.660 ms
16 bytes from 2001:1890:123a::1:1e, icmp_seq=7 hlim=49 time=155.869 ms
^C
--- www.ietf.org ping6 statistics --8 packets transmitted, 8 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 155.325/155.810/156.905/0.447 ms
Another very useful debugging tool is traceroute6(8). The traceroute man page describes this tool as print
the route packets take to network host. traceroute uses the Time exceeded ICMP messages to discover the intermediate routers on the path towards a destination. The principle behind traceroute is very simple. When a router
receives an IP packet whose Hop Limit is set to 1 it is forced to return to the sending host a Time exceeded ICMP
message containing the header and the first bytes of the discarded packet. To discover all routers on a network
path, a simple solution is to first send a packet whose Hop Limit is set to 1, then a packet whose Hop Limit is set
to 2, etc. A sample traceroute6 output is shown below.
#traceroute6 www.ietf.org
traceroute6 to www.ietf.org (2001:1890:1112:1::20) from 2001:6a8:3080:2:217:f2ff:fed6:65c0, 30 hop
1 2001:6a8:3080:2::1 13.821 ms 0.301 ms 0.324 ms
2 2001:6a8:3000:8000::1 0.651 ms 0.51 ms 0.495 ms
3 10ge.cr2.bruvil.belnet.net 3.402 ms 3.34 ms 3.33 ms
4 10ge.cr2.brueve.belnet.net 3.668 ms 10ge.cr2.brueve.belnet.net 3.988 ms 10ge.cr2.brueve.beln
5 belnet.rt1.ams.nl.geant2.net 10.598 ms 7.214 ms 10.082 ms
6 so-7-0-0.rt2.cop.dk.geant2.net 20.19 ms 20.002 ms 20.064 ms
7 kbn-ipv6-b1.ipv6.telia.net 21.078 ms 20.868 ms 20.864 ms
34
Until a few years ago, all hosts replied to Echo request ICMP messages. However, due to the security problems that have affected TCP/IP
implementations, many of these implementations can now be configured to disable answering Echo request ICMP messages.
184
185
interface operates. In shared media LANs, all devices are attached to the same physical medium and all frames are
delivered to all devices. When such a frame is received by a datalink layer interface, it compares the destination
address with the MAC address of the device. If the two addresses match, or the destination address is the broadcast
address, the frame is destined to the device and its payload is delivered to the network layer protocol. The multicast
service exploits this principle. A multicast address is a logical address. To receive frames destined to a multicast
address in a shared media LAN, a device captures all frames having this multicast address as their destination. All
IPv6 nodes are capable of capturing datalink layer frames destined to different multicast addresses.
A
MAC : 0023:4567:89ab
B
MAC : 0034:5678:9abc
lan
Hosts A and B are attached to the same datalink layer network. They can thus exchange frames by using the MAC
addresses shown in the figure above. To be able to use IPv6 to exchange packets, they need to have an IPv6 address.
One possibility would be to manually configure an IPv6 address on each host. However, IPv6 provides a better solution thanks to the link-local IPv6 addresses. A link-local IPv6 address is an address that is composed by concatenating the fe80:://64 prefix with the MAC address of the device. In the example above, host A would use IPv6
link-local address fe80::0223:45FF:FE67:89ab and host B fe80::0234:5678:9aFF:FEbc:dede.
With these two IPv6 addresses, the hosts can exchange IPv6 packets.
Note: Converting MAC addresses in host identifiers
Appendix A of RFC 4291 provides the algorithm used to convert a 48 bits MAC address into a 64 bits host
identifier. This algorithm builds upon the structure of the MAC addresses. A MAC address is represented as
shown in the figure below.
MAC addresses are allocated in blocks of 220 . When a company registers for a block of MAC addresses, it receives
an identifier. company identifier is then used to populated the c bits of the MAC addresses. The company can
allocate all addresses in starting with this prefix and mangages the m bits as it wishes.
36 For simplicity, you assume that each datalink layer interface is assigned a 64 bits MAC address. As we will see later, todays datalink
layer technologies mainly use 48 bits MAC addresses, but the smaller addresses can easily be converted into 64 bits addresses.
186
A
MAC : 0023:4567:89ab
B
MAC : 0034:5678:9abc
router
0045:6789:abcd
lan
Assume that the LAN containing the two hosts and the router is assigned prefix 2001:db8:1234:5678/64.
A first solution to configure the IPv6 addresses in this network is to assign them manually. A possible assignment
is :
 2001:db8:1234:5678::1 is assigned to router
 2001:db8:1234:5678::AA is assigned to hostA
 2001:db8:1234:5678::BB is assigned to hostB
To be able to exchange IPv6 packets with hostB, hostA needs to know the MAC address of the interface of
hostB on the LAN. This is the address resolution problem. In IPv6, this problem is solved by using the Neighbor
Discovery Protocol (NDP). NDP is specified in RFC 4861. This protocol is part of ICMPv6 and uses the multicast
datalink layer service.
NDP allows a host to discover the MAC address used by any other host attached to the same LAN. NDP operates in
two steps. First, the querier sends a multicast ICMPv6 Neighbor Solicitation message that contains as parameter
the queried IPv6 address. This multicast ICMPv6 NS is placed inside a multicast frame 37 . The queried node
37 RFC 4291 and RFC 4861 explain in more details how the IPv6 multicast address is determined from the target IPv6 unicast address.
These details are outside the scope of this book, but may matter if you try to understand a packet trace.
187
receives the frame, parses it and replies with a unicast ICMPv6 Neighbor Advertisement that provides its own
IPv6 and MAC addresses. Upon reception of the Neighbor Advertisement message, the querier stores the mapping
between the IPv6 and the MAC address inside its NDP table. This table is a data structure that maintains a cache
of the recently received Neighbor Advertisement. Thanks to this cache, a host only needs to send a Neighbor
Sollicitation message for the first packet that it sends to a given host. After this initial packet, the NDP table can
provide the mapping between the destination IPv6 address and the corresponding MAC address.
router
hostA
NS : Who has 2001:db8:1234:5678::BB
hostB
NA : 1234:5678:9abc:dede
The NS message can also be used to verify the reachability of a host in the local subnet. For this usage, NS
messages can be sent in unicast since other nodes on the subnet do not need to process the message.
When an entry in the NDP table times out on a host, it may either be deleted or the host may try to revalidate it by
sending the NS message again.
This is not the only usage of the Neighbor Solicitation and Neighbor Advertisement messages. They are also
used to detect the utilization of duplicate addresses. In the network above, consider what happens when a
new host is connected to the LAN. If this host is configured by mistake with the same address as hostA (i.e.
2001:db8:1234:5678::AA), problems could occur. Indeed, if two hosts have the same IPv6 address on the
LAN, but different MAC addresses, it will be difficult to correctly reach them. IPv6 anticipated this problem and
includes a Duplicate Address Detection Algorithm (DAD). When an IPv6 address 38 is configured on a host, by
any means, the host must verify the uniqueness of this address on the LAN. For this, it multicasts an ICMPv6
Neighbor Solicitation that queries the network for its newly configured address. The IPv6 source address of this
NS is set to :: (i.e. the reserved unassigned address) if the host does not already have an IPv6 address on this
subnet). If the NS does not receive any answer, the new address is considered to be unique and can safely be
used. Otherwise, the new address is refused and an error message should be returned to the system administrator or a new IPv6 address should be generated. The Duplicate Address Detection Algorithm can prevent various
operational problems that are often difficult to debug.
Few users manually configure the IPv6 addresses on their hosts. They prefer to rely on protocols that can automatically configure their IPv6 addresses. IPv6 supports two such protocols : DHCPv6 and the Stateless Address
Autoconfiguration (SLAAC).
The Stateless Address Autoconfiguration (SLAAC) mechanism defined in RFC 4862 enables hosts to automatically configure their addresses without maintaining any state. When a host boots, it derives its identifier from
its datalink layer address 39 as explained earlier and concatenates this 64 bits identifier to the FE80::/64 prefix
to obtain its link-local IPv6 address. It then multicasts a Neighbour Solicitation with its link-local address as a
target to verify whether another host is using the same link-local address on this subnet. If it receives a Neighbour
Advertisement indicating that the link-local address is used by another host, it generates another 64 bits identifier
and sends again a Neighbour Solicitation. If there is no answer, the host considers its link-local address to be
valid. This address will be used as the source address for all NDP messages sent on the subnet.
To automatically configure its global IPv6 address, the host must know the globally routable IPv6 prefix that is
used on the local subnet. IPv6 routers regularly multicast ICMPv6 Router Advertisement messages that indicate
the IPv6 prefix assigned to the subnet. The Router Advertisement message contains several interesting fields.
This message is sent from the link-local address of the router on the subnet. Its destination is the IPv6 multicast
address that targets all IPv6 enabled hosts (i.e. ff02::1). The Cur Hop Limit field, if different from zero, allows
to specify the default Hop Limit that hosts should use when sending IPv6 from this subnet. 64 is a frequently used
value. The M and O bits are used to indicate that some information can be obtained from DHCPv6. The Router
Lifetime parameter provides the expected lifetime (in seconds) of the sending router acting as a default router.
This lifetime allows to plan the replacement of a router by another one in the same subnet. The Reachable Time
38
188
189
to the same subnet. Using a Hop Limit of 255 provides a simple check for this. If the message was generated by
an attacker outside the subnet, it would reach the subnet with a decremented Hop Limit. Checking that the Hop
Limit is set to 255 is a simple 40 verification that the packet was generated on this particular subnet. RFC 5082
provides other examples of protocols that use this hack and discuss its limitations.
Routers regularly send Router Advertisement messages. These messages are triggered by a timer that is often set
at approximately 30 seconds. Usually, hosts wait for the arrival of a Router Advertisement message to configure
their address. This implies that hosts could sometimes need to wait 30 seconds before being able to configure their
address. If this delay is too long, a host can also send a Router Solicitation message. This message is sent towards
the multicast address that corresponds to all IPv6 routers (i.e. FF01::2) and the default router will reply.
The last point that needs to be explained about ICMPv6 is the Redirect message. This message is used when there
is more than one router on a subnet as shown in the figure below.
A
MAC : 0023:4567:89ab
router1
0045:6789:abcd
B
MAC : 0034:5678:9abc
router2
0012:3456:7878
lan
In this network, router1 is the default router for all hosts. The second router, router2 provides connectivity
to a specific IPv6 subnet, e.g. 2001:db8:abcd::/48. These two routers attached to the same subnet can be
used in different ways. First, it is possible to manually configure the routing tables on all hosts to add a route
towards 2001:db8:abcd::/48 via router2. Unfortunately, forcing such manual configuration boils down
all the benefits of using address auto-configuration in IPv6. The second approach is to automatically configure
a default route via router1 on all hosts. With such route, when a host needs to send a packet to any address
within 2001:db8:abcd::/48, it will send it to router1. router1 would consult its routing table and find
that the packet needs to be sent again on the subnet to reach router2. This is a waste of time. A better approach
would be to enable the hosts to automatically learn the new route. This is possible thanks to the ICMPv6 Redirect
message. When router1 receives a packet that needs to be forwarded back on the same interface, it replies
with a Redirect message that indicates that the packet should have been sent via router2. Upon reception of a
Redirect message, the host updates it forwarding table to include a new transient entry for the destination reported
in the message. A timeout is usually associated with this transient entry to automatically delete it after some time.
An alternative is the Dynamic Host Configuration Protocol (DHCP) defined in RFC 2131 and RFC 3315. DHCP
allows a host to automatically retrieve its assigned IPv6 address, but relies on server. A DHCP server is associated
to each subnet 41 . Each DHCP server manages a pool of IPv6 addresses assigned to the subnet. When a host is
first attached to the subnet, it sends a DHCP request message in a UDP segment (the DHCP server listens on port
67). As the host knows neither its IPv6 address nor the IPv6 address of the DHCP server, this UDP segment is sent
inside a multicast packet target at the DHCP servers. The DHCP request may contain various options such as the
name of the host, its datalink layer address, etc. The server captures the DHCP request and selects an unassigned
address in its address pool. It then sends the assigned IPv6 address in a DHCP reply message which contains the
datalink layer address of the host and additional information such as the subnet mask, the address of the default
router or the address of the DNS resolver. The DHCP reply also specifies the lifetime of the address allocation.
This forces the host to renew its address allocation once it expires. Thanks to the limited lease time, IP addresses
are automatically returned to the pool of addresses when hosts are powered off.
Both SLAAC and DHCPv6 can be extended to provide additional information beyond the IPv6 prefix/address. For
40 Using a Hop Limit of 255 prevents one family of attacks against ICMPv6, but other attacks still remain possible. A detailed discussion
of the security issues with IPv6 is outside the scope of this book. It is possible to secure NDP by using the Cryptographically Generated IPv6
Addresses (CGA) defined in RFC 3972. The Secure Neighbour Discovery Protocol is defined in RFC 3971. A detailed discussion of the
security of IPv6 may be found in [HV2008].
41 In practice, there is usually one DHCP server per group of subnets and the routers capture on each subnet the DHCP messages and
forward them to the DHCP server.
190
example, RFC 6106 defines options for the ICMPv6 ND message that can carry the IPv6 address of the recursive
DNS resolver and a list of default domain search suffixes. It is also possible to combine SLAAC with DHCPv6.
RFC 3736 defines a stateless variant of DHCPv6 that can be used to distribute DNS information while SLAAC is
used to distribute the prefixes.
Warning: This is an unpolished draft of the second edition of this ebook. If you find any error or have
suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues/new
See http://bgp.potaroo.net/index-as.html for reports on the evolution of the number of Autonomous Systems over time.
191
filter that specifies which routes can be accepted by a domain, an export filter that specifies which routes can be
advertised by a domain and a ranking algorithm that selects the best route when a domain knows several routes
towards the same destination prefix. As we will see later, another important difference is that the objective of the
interdomain routing protocol is to find the cheapest route towards each destination. There is only one interdomain
routing protocol : BGP.
3.14.1 RIP
The Routing Information Protocol (RIP) is the simplest routing protocol that was standardised for the TCP/IP
protocol suite. RIP is defined in RFC 2453. Additional information about RIP may be found in [Malkin1999]
RIP routers periodically exchange RIP messages. The format of these messages is shown below. A RIP message
is sent inside a UDP segment whose destination port is set to 521. A RIP message contains several fields. The
Cmd field indicates whether the RIP message is a request or a response. When a router boots, its routing table is
empty and it cannot forward any packet. To speedup the discovery of the network, it can send a request message to
the RIP IPv6 multicast address, FF02::9. All RIP routers listen to this multicast address and any router attached
to the subnet will reply by sending its own routing table as a sequence of RIP messages. In steady state, routers
multicast one of more RIP response messages every 30 seconds. These messages contain the distance vectors that
summarize the routers routing table. The current version of RIP is version 2 defined in RFC 2453 for IPv4 and
RFC 2080 for IPv6.
192
3.14.2 OSPF
Link-state routing protocols are used in IP networks. Open Shortest Path First (OSPF), defined in RFC 2328, is the
link state routing protocol that has been standardised by the IETF. The last version of OSPF, which supports IPv6,
is defined in RFC 5340. OSPF is frequently used in enterprise networks and in some ISP networks. However,
ISP networks often use the IS-IS link-state routing protocol [ISO10589] , which was developed for the ISO CLNP
protocol but was adapted to be used in IP RFC 1195 networks before the finalisation of the standardisation
of OSPF. A detailed analysis of ISIS and OSPF may be found in [BMO2006] and [Perlman2000]. Additional
information about OSPF may be found in [Moy1998].
Compared to the basics of link-state routing protocols that we discussed in section Link state routing, there are
some particularities of OSPF that are worth discussing. First, in a large network, flooding the information about
all routers and links to thousands of routers or more may be costly as each router needs to store all the information
about the entire network. A better approach would be to introduce hierarchical routing. Hierarchical routing
divides the network into regions. All the routers inside a region have detailed information about the topology of
the region but only learn aggregated information about the topology of the other regions and their interconnections.
OSPF supports a restricted variant of hierarchical routing. In OSPFs terminology, a region is called an area.
OSPF imposes restrictions on how a network can be divided into areas. An area is a set of routers and links that
are grouped together. Usually, the topology of an area is chosen so that a packet sent by one router inside the area
can reach any other router in the area without leaving the area 43 . An OSPF area contains two types of routers
RFC 2328:
 Internal router : A router whose directly connected networks belong to the area
 Area border routers : A router that is attached to several areas.
For example, the network shown in the figure below has been divided into three areas : area 1, containing routers
R1, R3, R4, R5 and RA, area 2 containing R7, R8, R9, R10, RB and RC. OSPF areas are identified by a 32 bit
integer, which is sometimes represented as an IP address. Among the OSPF areas, area 0, also called the backbone
area has a special role. The backbone area groups all the area border routers (routers RA, RB and RC in the figure
below) and the routers that are directly connected to the backbone routers but do not belong to another area (router
RD in the figure below). An important restriction imposed by OSPF is that the path between two routers that
belong to two different areas (e.g. R1 and R8 in the figure below) must pass through the backbone area.
Inside each non-backbone area, routers distribute the topology of the area by exchanging link state packets with
the other routers in the area. The internal routers do not know the topology of other areas, but each router knows
how to reach the backbone area. Inside an area, the routers only exchange link-state packets for all destinations
that are reachable inside the area. In OSPF, the inter-area routing is done by exchanging distance vectors. This is
illustrated by the network topology shown below.
Let us first consider OSPF routing inside area 2. All routers in the area learn a route towards 2001:db8:1234::/48
and 2001:db8:5678::/48. The two area border routers, RB and RC, create network summary advertisements.
43 OSPF can support virtual links to connect routers together that belong to the same area but are not directly connected. However, this goes
beyond this introduction to OSPF.
193
194
Assuming that all links have a unit link metric, these would be:
 RB advertises 2001:db8:1234::/48 at a distance of 2 and 2001:db8:5678::/48 at a distance of 3
 RC advertises 2001:db8:5678::/48 at a distance of 2 and 2001:db8:1234::/48 at a distance of 3
These summary advertisements are flooded through the backbone area attached to routers RB and RC. In its
routing table, router RA selects the summary advertised by RB to reach 2001:db8:1234::/48 and the summary
advertised by RC to reach 2001:db8:5678::/48. Inside area 1, router RA advertises a summary indicating that
2001:db8:1234::/48 and 2001:db8:5678::/48 are both at a distance of 3 from itself.
On the other hand, consider the prefixes 2001:db8:aaaa:0000::/64 and 2001:db8:aaaa:0001::/64 that are inside
area 1. Router RA is the only area border router that is attached to this area. This router can create two different
network summary advertisements :
 2001:db8:aaaa:0000::/64 at a distance of 1 and 2001:db8:aaaa:0001::/64 at a distance of 2 from RA
 2001:db8:aaaa:0000::/63 at a distance of 2 from RA
The first summary advertisement provides precise information about the distance used to reach each prefix. However, all routers in the network have to maintain a route towards 2001:db8:aaaa:0000::/64 and a route towards
2001:db8:aaaa:0001::/64 that are both via router RA. The second advertisement would improve the scalability
of OSPF by reducing the number of routes that are advertised across area boundaries. However, in practice this
requires manual configuration on the border routers.
The second OSPF particularity that is worth discussing is the support of Local Area Networks (LAN). As shown
in the example below, several routers may be attached to the same LAN.
R1
2001:db8:1234::11/48
R2
2001:db8:1234::22/48
R3
2001:db8:1234::33/48
R4
2001:db8:1234::44/48
lan
A first solution to support such a LAN with a link-state routing protocol would be to consider that a LAN is
equivalent to a full-mesh of point-to-point links as if each router can directly reach any other router on the LAN.
However, this approach has two important drawbacks :
1. Each router must exchange HELLOs and link state packets with all the other routers on the LAN. This
increases the number of OSPF packets that are sent and processed by each router.
2. Remote routers, when looking at the topology distributed by OSPF, consider that there is a full-mesh of
links between all the LAN routers. Such a full-mesh implies a lot of redundancy in case of failure, while in
practice the entire LAN may completely fail. In case of a failure of the entire LAN, all routers need to detect
the failures and flood link state packets before the LAN is completely removed from the OSPF topology by
remote routers.
To better represent LANs and reduce the number of OSPF packets that are exchanged, OSPF handles LAN differently. When OSPF routers boot on a LAN, they elect 44 one of them as the Designated Router (DR) RFC 2328.
The DR router represents the local area network, and advertises the LANs subnet. Furthermore, LAN routers
only exchange HELLO packets with the DR. Thanks to the utilisation of a DR, the topology of the LAN appears
as a set of point-to-point links connected to the DR router.
Note: How to quickly detect a link failure ?
Network operators expect an OSPF network to be able to quickly recover from link or router failures [VPD2004].
In an OSPF network, the recovery after a failure is performed in three steps [FFEB2005] :
44 The OSPF Designated Router election procedure is defined in RFC 2328. Each router can be configured with a router priority that
influences the election process since the router with the highest priority is preferred when an election is run.
195
 the routers that are adjacent to the failure detect it quickly. The default solution is to rely on the regular
exchange of HELLO packets. However, the interval between successive HELLOs is often set to 10 seconds... Setting the HELLO timer down to a few milliseconds is difficult as HELLO packets are created and
processed by the main CPU of the routers and these routers cannot easily generate and process a HELLO
packet every millisecond on each of their interfaces. A better solution is to use a dedicated failure detection
protocol such as the Bidirectional Forwarding Detection (BFD) protocol defined in [KW2009] that can be
implemented directly on the router interfaces. Another solution to be able to detect the failure is to instrument the physical and the datalink layer so that they can interrupt the router when a link fails. Unfortunately,
such a solution cannot be used on all types of physical and datalink layers.
 the routers that have detected the failure flood their updated link state packets in the network
 all routers update their routing table
A last, but operationally important, point needs to be discussed about intradomain routing protocols such as OSPF
and IS-IS. Intradomain routing protocols always select the shortest path for each destination. In practice, there are
often several equal paths towards the same destination. When a router computes several equal cost paths towards
one destination, it can use these paths in different ways.
A first approach is to select one of the equal cost paths (e.g. the first or the last path found by the SPF computation)
and install it in the forwarding table. In this case, only one path is used to reach each destination.
A second approach is to install all equal cost paths 45 in the forwarding table and load-balance the packets on the
different paths. Consider the case where a router has N different outgoing interfaces to reach destination d. A first
possibility to load-balance the traffic among these interfaces is to use round-robin. Round-robin allows to equally
balance the packets among the N outgoing interfaces. This equal load-balancing is important in practice because
it allows to better spread the load throughout the network. However, few networks use this round-robin strategy
to load-balance traffic on routers. The main drawback of round-robin is that packets that belong to the same flow
(e.g. TCP connection) may be forwarded over different paths. If packets belonging to the same TCP connection
are sent over different paths, they will probably experience different delays and arrive out-of-sequence at their
destination. When a TCP receiver detects out-of-order segments, it sends duplicate acknowledgements that may
cause the sender to initiate a fast retransmission and enter congestion avoidance. Thus, out-of-order segments may
lead to lower TCP performance. This is annoying for a load-balancing technique whose objective is to improve
the network performance by spreading the load.
To efficiently spread the load over different paths, routers need to implement per-flow load-balancing. This implies
that they must forward all the packets that belong to the same flow on the same path. Since a TCP connection is
always identified by the four-tuple (source and destination addresses, source and destination ports), one possibility
would be to select an outgoing interface upon arrival of the first packet of the flow and store this decision in the
routers memory. Unfortunately, such a solution does not scale since the required memory grows with the number
of TCP connections that pass through the router.
Fortunately, it is possible to perform per-flow load balancing without maintaining any state on the router. Most
routers today use hash functions for this purpose RFC 2991. When a packet arrives, the router extracts the Next
Header information and the four-tuple from the packet and computes :
( ,  ,  ,   ,   )
In this formula, N is the number of outgoing interfaces on the equal cost paths towards the packets destination.
Various hash functions are possible, including CRC, checksum or MD5 RFC 2991. Since the hash function is
computed over the four-tuple, the same hash value will be computed for all packets belonging to the same flow.
This prevents reordering due to load balancing inside the network. Most routers support this kind of load-balancing
today [ACO+2006].
Warning: This is an unpolished draft of the second edition of this ebook. If you find any error or have
suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues/new
45 In some networks, there are several dozens of paths towards a given destination. Some routers, due to hardware limitations, cannot install
more than 8 or 16 paths in their forwarding table. In this case, a subset of the computed paths is installed in the forwarding table.
196
197
of these domains, using only private peering links would be too costly. A better solution to allow many domains
to interconnect cheaply are the Internet eXchange Points (IXP). An IXP is usually some space in a data center that
hosts routers belonging to different domains. A domain willing to exchange packets with other domains present
at the IXP installs one of its routers on the IXP and connects it to other routers inside its own network. The IXP
contains a Local Area Network to which all the participating routers are connected. When two domains that are
present at the IXP wish 49 to exchange packets, they simply use the Local Area Network. IXPs are very popular
in Europe and many Internet Service Providers and Content providers are present in these IXPs.
198
know a route via AS4 that allows them to reach hosts inside AS7. From a routing perspective, the commercial
contract between AS7 and AS4 leads to the following routes being exchanged :
 over a customer->provider relationship, the customer domain advertises to its provider all its routes and all
the routes that it has learned from its own customers.
 over a provider->customer relationship, the provider advertises all the routes that it knows to its customer.
The second rule ensures that the customer domain receives a route towards all destinations that are reachable via
its provider. The first rule allows the routes of the customer domain to be distributed throughout the Internet.
Coming back to the figure above, AS4 advertises to its two providers AS1 and AS2 its own routes and the routes
learned from its customer, AS7. On the other hand, AS4 advertises to AS7 all the routes that it knows.
The second type of peering relationship is the shared-cost peering relationship. Such a relationship usually does
not involve a payment from one domain to the other in contrast with the customer->provider relationship. A
shared-cost peering relationship is usually established between domains having a similar size and geographic
coverage. For example, consider the figure above. If AS3 and AS4 exchange many packets via AS1, they both need
to pay AS1. A cheaper alternative for AS3 and AS4 would be to establish a shared-cost peering. Such a peering
can be established at IXPs where both AS3 and AS4 are present or by using private peering links. This shared-cost
peering should be used to exchange packets between hosts inside AS3 and hosts inside AS4. However, AS3 does
not want to receive on the AS3-AS4 shared-cost peering links packets whose destination belongs to AS1 as AS3
would have to pay to send these packets to AS1.
From a routing perspective, over a shared-cost peering relationship a domain only advertises its internal routes
and the routes that it has learned from its customers. This restriction ensures that only packets destined to the
local domain or one of its customers is received over the shared-cost peering relationship. This implies that the
routes that have been learned from a provider or from another shared-cost peer is not advertised over a shared-cost
peering relationship. This is motivated by economical reasons. If a domain were to advertise the routes that it
learned from a provider over a shared-cost peering relationship that does not bring revenue, it would have allowed
its shared-cost peer to use the link with its provider without any payment. If a domain were to advertise the routes
it learned over a shared cost peering over another shared-cost peering relationship, it would have allowed these
shared-cost peers to use its own network (which may span one or more continents) freely to exchange packets.
Finally, the last type of peering relationship is the sibling. Such a relationship is used when two domains exchange
all their routes in both directions. In practice, such a relationship is only used between domains that belong to the
same company.
These different types of relationships are implemented in the interdomain routing policies defined by each domain.
The interdomain routing policy of a domain is composed of three main parts :
 the import filter that specifies, for each peering relationship, the routes that can be accepted from the neighbouring domain (the non-acceptable routes are ignored and the domain never uses them to forward packets)
 the export filter that specifies, for each peering relationship, the routes that can be advertised to the neighbouring domain
 the ranking algorithm that is used to select the best route among all the routes that the domain has received
towards the same destination prefix
A domains import and export filters can be defined by using the Route Policy Specification Language (RPSL)
specified in RFC 2622 [GAVE1999] . Some Internet Service Providers, notably in Europe, use RPSL to document
50
their import and export policies. Several tools help to easily convert a RPSL policy into router commands.
The figure below provides a simple example of import and export filters for two domains in a simple internetwork.
In RPSL, the keyword ANY is used to replace any route from any domain. It is typically used by a provider to
indicate that it announces all its routes to a customer over a provider->customer relationship. This is the case
for AS4s export policy. The example below clearly shows the difference between a provider->customer and a
shared-cost peering relationship. AS4s export filter indicates that it announces only its internal routes (AS4) and
the routes learned from its clients (AS7) over its shared-cost peering with AS3, while it advertises all the routes
that it uses (including the routes learned from AS3) to AS7.
50
See ftp://ftp.ripe.net/ripe/dbase for the RIPE database that contains the import and export policies of many European ISPs
199
200
2001:db8:cafe::/48
AS Path : AS2:AS4:AS1
AS5
AS2
2001:db8:cafe::/48
AS Path : AS1
2001:db8:cafe::/48
AS Path : AS4:AS1
AS1
2001:db8:cafe::/48
2001:db8:cafe::/48
AS Path : AS1
AS4
Figure 3.70: A BGP peering session between two directly connected routers
has been configured with the IP address of R1 and its AS number. For security reasons, a router never establishes
a BGP session that has not been manually configured on the router.
The BGP protocol RFC 4271 defines several types of messages that can be exchanged over a BGP session :
 OPEN : this message is sent as soon as the TCP connection between the two routers has been established.
It initialises the BGP session and allows the negotiation of some options. Details about this message may
be found in RFC 4271
 NOTIFICATION : this message is used to terminate a BGP session, usually because an error has been detected by the BGP peer. A router that sends or receives a NOTIFICATION message immediately shutdowns
the corresponding BGP session.
 UPDATE: this message is used to advertise new or modified routes or to withdraw previously advertised
routes.
 KEEPALIVE : this message is used to ensure a regular exchange of messages on the BGP session, even
when no route changes. When a BGP router has not sent an UPDATE message during the last 30 seconds,
it shall send a KEEPALIVE message to confirm to the other peer that it is still up. If a peer does not receive
any BGP message during a period of 90 seconds 53 , the BGP session is considered to be down and all the
routes learned over this session are withdrawn.
As explained earlier, BGP relies on incremental updates. This implies that when a BGP session starts, each router
first sends BGP UPDATE messages to advertise to the other peer all the exportable routes that it knows. Once
all these routes have been advertised, the BGP router only sends BGP UPDATE messages about a prefix if the
route is new, one of its attributes has changed or the route became unreachable and must be withdrawn. The BGP
UPDATE message allows BGP routers to efficiently exchange such information while minimising the number of
bytes exchanged. Each UPDATE message contains :
53
90 seconds is the default delay recommended by RFC 4271. However, two BGP peers can negotiate a different timer during the
establishment of their BGP session. Using a too small interval to detect BGP session failures is not recommended. BFD [KW2009] can be
used to replace BGPs KEEPALIVE mechanism if fast detection of interdomain link failures is required.
201
202
In the above pseudo-code, the build_BGP_UPDATE(d) procedure extracts from the BGP Loc-RIB the best path
towards destination d (i.e. the route installed in the FIB) and prepares the corresponding BGP UPDATE message.
This message is then passed to the export filter that returns NULL if the route cannot be advertised to the peer or
the (possibly modified) BGP UPDATE message to be advertised. BGP routers allow network administrators to
specify very complex export filters, see e.g. [WMS2004]. A simple export filter that implements the equivalent of
split horizon is shown below.
def apply_export_filter(RemoteAS, BGPMsg) :
# check if RemoteAS already received route
if RemoteAS is BGPMsg.ASPath :
BGPMsg=None
# Many additional export policies can be configured :
# Accept or refuse the BGPMsg
# Modify selected attributes inside BGPMsg
return BGPMsg
At this point, the remote router has received all the exportable BGP routes. After this initial exchange, the router
only sends BGP UPDATE messages when there is a change (addition of a route, removal of a route or change in
the attributes of a route) in one of these exportable routes. Such a change can happen when the router receives a
BGP message. The pseudo-code below summarizes the processing of these BGP messages.
def Recvd_BGPMsg(Msg, RemoteAS) :
B=apply_import_filer(Msg,RemoteAS)
if (B== None): # Msg not acceptable
return
if IsUPDATE(Msg):
Old_Route=BestRoute(Msg.prefix)
Insert_in_RIB(Msg)
Run_Decision_Process(RIB)
if (BestRoute(Msg.prefix) != Old_Route) :
# best route changed
B=build_BGP_Message(Msg.prefix);
S=apply_export_filter(RemoteAS,B);
if (S!=None) : # announce best route
send_UPDATE(S,RemoteAS,RemoteIP);
else if (Old_Route != None) :
send_WITHDRAW(Msg.prefix,RemoteAS, RemoteIP)
else : # Msg is WITHDRAW
Old_Route=BestRoute(Msg.prefix)
Remove_from_RIB(Msg)
Run_Decision_Process(RIB)
if (Best_Route(Msg.prefix) !=Old_Route):
# best route changed
B=build_BGP_Message(Msg.prefix)
S=apply_export_filter(RemoteAS,B)
if (S != None) : # still one best route towards Msg.prefix
send_UPDATE(S,RemoteAS, RemoteIP);
else if(Old_Route != None) : # No best route anymore
send_WITHDRAW(Msg.prefix,RemoteAS,RemoteIP);
When a BGP message is received, the router first applies the peers import filter to verify whether the message is
acceptable or not. If the message is not acceptable, the processing stops. The pseudo-code below shows a simple
import filter. This import filter accepts all routes, except those that already contain the local AS in their AS-Path.
3.15. Interdomain routing
203
If such a route was used, it would cause a routing loop. Another example of an import filter would be a filter used
by an Internet Service Provider on a session with a customer to only accept routes towards the IP prefixes assigned
to the customer by the provider. On real routers, import filters can be much more complex and some import filters
modify the attributes of the received BGP UPDATE [WMS2004] .
def apply_import_filter(RemoteAS, BGPMsg):
if MysAS in BGPMsg.ASPath :
BGPMsg=None
# Many additional import policies can be configured :
# Accept or refuse the BGPMsg
# Modify selected attributes inside BGPMsg
return BGPMsg
204
205
Sometimes, the local-pref attribute is used to prefer a cheap link compared to a more expensive one. For example,
in the network below, AS1 could wish to send and receive packets mainly via its interdomain link with AS4.
Figure 3.74: How to prefer a cheap link over an more expensive one ?
AS1 can install the following import filter on R1 to ensure that it always sends packets via R2 when it has learned
a route via AS2 and another via AS4.
import: from AS2 RA at R1 set localpref=100;
from AS4 R2 at R1 set localpref=200;
accept ANY
However, this import filter does not influence how AS3 , for example, prefers some routes over others. If the link
between AS3 and AS2 is less expensive than the link between AS3 and AS4, AS3 could send all its packets via AS2
and AS1 would receive packets over its expensive link. An important point to remember about local-pref is that
it can be used to prefer some routes over others to send packets, but it has no influence on the routes followed by
received packets.
Another important utilisation of the local-pref attribute is to support the customer->provider and shared-cost peering relationships. From an economic point of view, there is an important difference between these three types of
peering relationships. A domain usually earns money when it sends packets over a provider->customer relationship. On the other hand, it must pay its provider when it sends packets over a customer->provider relationship.
Using a shared-cost peering to send packets is usually neutral from an economic perspective. To take into account
these economic issues, domains usually configure the import filters on their routers as follows :
 insert a high local-pref attribute in the routes learned from a customer
 insert a medium local-pref attribute in the routes learned over a shared-cost peering
 insert a low local-pref attribute in the routes learned from a provider
206
With such an import filter, the routers of a domain always prefer to reach destinations via their customers whenever
such a route exists. Otherwise, they prefer to use shared-cost peering relationships and they only send packets
via their providers when they do not know any alternate route. A consequence of setting the local-pref attribute
like this is that Internet paths are often asymmetrical. Consider for example the internetwork shown in the figure
below.
207
reception of the BGP Withdraws, AS3 and AS4 only know the direct route towards 2001:db8:1234/48. AS3
(resp. AS4) sends U(2001:db8:1234/48,AS3:AS1) (resp. U(2001:db8:1234/48,AS4:AS1)) to AS4 (resp.
AS3). AS3 and AS4 could in theory continue to exchange BGP messages for ever. In practice, one of them
sends one message faster than the other and BGP converges.
The example above has shown that the routes selected by BGP routers may sometimes depend on the ordering of
the BGP messages that are exchanged. Other similar scenarios may be found in RFC 4264.
From an operational perspective, the above configuration is annoying since the network operators cannot easily
predict which paths are chosen. Unfortunately, there are even more annoying BGP configurations. For example,
let us consider the configuration below which is often named Bad Gadget [GW1999]
208
The first guideline implies that the provider of the provider of ASx cannot be a customer of ASx. Such a relationship
would not make sense from an economic perspective as it would imply circular payments. Furthermore, providers
are usually larger than customers.
The second guideline also corresponds to economic preferences. Since a provider earns money when sending
packets to one of its customers, it makes sense to prefer such customer learned routes over routes learned from
providers. [GR2001] also shows that BGP convergence is guaranteed even if an AS associates the same preference
to routes learned from a shared-cost peer and routes learned from a customer.
From a theoretical perspective, these guidelines should be verified automatically to ensure that BGP will always
converge in the global Internet. However, such a verification cannot be performed in practice because this would
force all domains to disclose their routing policies (and few are willing to do so) and furthermore the problem is
known to be NP-hard [GW1999].
In practice, researchers and operators expect that these guidelines are verified 55 in most domains. Thanks to the
large amount of BGP data that has been collected by operators and researchers 56 , several studies have analysed
the AS-level topology of the Internet. [SARK2002] is one of the first analysis. More recent studies include
[COZ2008] and [DKF+2007]
Based on these studies and [ATLAS2009], the AS-level Internet topology can be summarised as shown in the
figure below.
209
Due to this organisation of the Internet and due to the BGP decision process, most AS-level paths on the Internet
have a length of 3-5 AS hops.
Warning: This is an unpolished draft of the second edition of this ebook. If you find any error or have
suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues/new
210
PPP supports variable length packets, but LCP can negotiate a maximum packet length. The PPP frame ends with
a Frame Check Sequence. The default is a 16 bits CRC, but some implementations can negotiate a 32 bits CRC.
The frame ends with the 01111110 flag.
3.16.2 Ethernet
Ethernet was designed in the 1970s at the Palo Alto Research Center [Metcalfe1976]. The first prototype 59 used
a coaxial cable as the shared medium and 3 Mbps of bandwidth. Ethernet was improved during the late 1970s
and in the 1980s, Digital Equipment, Intel and Xerox published the first official Ethernet specification [DIX]. This
specification defines several important parameters for Ethernet networks. The first decision was to standardise
the commercial Ethernet at 10 Mbps. The second decision was the duration of the slot time. In Ethernet, a long
slot time enables networks to span a long distance but forces the host to use a larger minimum frame size. The
compromise was a slot time of 51.2 microseconds, which corresponds to a minimum frame size of 64 bytes.
The third decision was the frame format. The experimental 3 Mbps Ethernet network built at Xerox used short
frames containing 8 bit source and destination addresses fields, a 16 bit type indication, up to 554 bytes of payload
and a 16 bit CRC. Using 8 bit addresses was suitable for an experimental network, but it was clearly too small
for commercial deployments. Although the initial Ethernet specification [DIX] only allowed up to 1024 hosts on
an Ethernet network, it also recommended three important changes compared to the networking technologies that
were available at that time. The first change was to require each host attached to an Ethernet network to have a
globally unique datalink layer address. Until then, datalink layer addresses were manually configured on each host.
[DP1981] went against that state of the art and noted Suitable installation-specific administrative procedures are
also needed for assigning numbers to hosts on a network. If a host is moved from one network to another it may
be necessary to change its host number if its former number is in use on the new network. This is easier said than
done, as each network must have an administrator who must record the continuously changing state of the system
(often on a piece of paper tacked to the wall !). It is anticipated that in future office environments, hosts locations
will change as often as telephones are changed in present-day offices. The second change introduced by Ethernet
was to encode each address as a 48 bits field [DP1981]. 48 bit addresses were huge compared to the networking
technologies available in the 1980s, but the huge address space had several advantages [DP1981] including the
ability to allocate large blocks of addresses to manufacturers. Eventually, other LAN technologies opted for 48 bits
addresses as well [802]_ . The third change introduced by Ethernet was the definition of broadcast and multicast
59
Additional information about the history of the Ethernet technology may be found at http://ethernethistory.typepad.com/
211
addresses. The need for multicast Ethernet was foreseen in [DP1981] and thanks to the size of the addressing
space it was possible to reserve a large block of multicast addresses for each manufacturer.
The datalink layer addresses used in Ethernet networks are often called MAC addresses. They are structured as
shown in the figure below. The first bit of the address indicates whether the address identifies a network adapter
or a multicast group. The upper 24 bits are used to encode an Organisation Unique Identifier (OUI). This OUI
identifies a block of addresses that has been allocated by the secretariat 60 that is responsible for the uniqueness
of Ethernet addresses to a manufacturer. Once a manufacturer has received an OUI, it can build and sell products
with one of the 16 million addresses in this block.
Initially, the OUIs were allocated by Xerox [DP1981]. However, once Ethernet became an IEEE and later an ISO standard, the allocation
of the OUIs moved to IEEE. The list of all OUI allocations may be found at http://standards.ieee.org/regauth/oui/index.shtml
61 The official list of all assigned Ethernet type values is available from http://standards.ieee.org/regauth/ethertype/eth.txt
62 The attentive reader may question the need for different EtherTypes for IPv4 and IPv6 while the IP header already contains a version
field that can be used to distinguish between IPv4 and IPv6 packets. Theoretically, IPv4 and IPv6 could have used the same EtherType.
Unfortunately, developers of the early IPv6 implementations found that some devices did not check the version field of the IPv4 packets that
they received and parsed frames whose EtherType was set to 0x0800 as IPv4 packets. Sending IPv6 packets to such devices would have caused
disruptions. To avoid this problem, the IETF decided to apply for a distinct EtherType value for IPv6. Such a choice is now mandated by RFC
6274 (section 3.1), although we can find a funny counter-example in RFC 6214.
212
Figure 3.81: Impact of the frame length on the maximum channel utilisation [SH1980]
213
provide hardware assistance to compute the TCP checksum, but this is more complex than if the TCP checksum
were placed in the trailer 63 .
The Ethernet frame format shown above is specified in [DIX]. This is the format used to send both IPv4 RFC 894
and IPv6 packets RFC 2464. After the publication of [DIX], the Institute of Electrical and Electronic Engineers
(IEEE) began to standardise several Local Area Network technologies. IEEE worked on several LAN technologies, starting with Ethernet, Token Ring and Token Bus. These three technologies were completely different, but
they all agreed to use the 48 bits MAC addresses specified initially for Ethernet [802]_ . While developing its
Ethernet standard [802.3], the IEEE 802.3 working group was confronted with a problem. Ethernet mandated a
minimum payload size of 46 bytes, while some companies were looking for a LAN technology that could transparently transport short frames containing only a few bytes of payload. Such a frame can be sent by an Ethernet
host by padding it to ensure that the payload is at least 46 bytes long. However since the Ethernet header [DIX]
does not contain a length field, it is impossible for the receiver to determine how many useful bytes were placed
inside the payload field. To solve this problem, the IEEE decided to replace the Type field of the Ethernet [DIX]
header with a length field 64 . This Length field contains the number of useful bytes in the frame payload. The
payload must still contain at least 46 bytes, but padding bytes are added by the sender and removed by the receiver.
In order to add the Length field without significantly changing the frame format, IEEE had to remove the Type
field. Without this field, it is impossible for a receiving host to identify the type of network layer packet inside a
received frame. To solve this new problem, IEEE developed a completely new sublayer called the Logical Link
Control [802.2]. Several protocols were defined in this sublayer. One of them provided a slightly different version
of the Type field of the original Ethernet frame format. Another contained acknowledgements and retransmissions
to provide a reliable service... In practice, [802.2] is never used to support IP in Ethernet networks. The figure
below shows the official [802.3] frame format.
214
then enabled the utilisation of 500 meter long segments. A 10Base5 network can also include repeaters between
segments.
The second physical layer was 10Base2. This physical layer used a thin coaxial cable that was easier to install
than the 10Base5 cable, but could not be longer than 185 meters. A 10BaseF physical layer was also defined
to transport Ethernet over point-to-point optical links. The major change to the physical layer was the support
of twisted pairs in the 10BaseT specification. Twisted pair cables are traditionally used to support the telephone
service in office buildings. Most office buildings today are equipped with structured cabling. Several twisted pair
cables are installed between any room and a central telecom closet per building or per floor in large buildings.
These telecom closets act as concentration points for the telephone service but also for LANs.
The introduction of the twisted pairs led to two major changes to Ethernet. The first change concerns the physical
topology of the network. 10Base2 and 10Base5 networks are shared buses, the coaxial cable typically passes
through each room that contains a connected computer. A 10BaseT network is a star-shaped network. All the
devices connected to the network are attached to a twisted pair cable that ends in the telecom closet. From
a maintenance perspective, this is a major improvement. The cable is a weak point in 10Base2 and 10Base5
networks. Any physical damage on the cable broke the entire network and when such a failure occurred, the
network administrator had to manually check the entire cable to detect where it was damaged. With 10BaseT,
when one twisted pair is damaged, only the device connected to this twisted pair is affected and this does not
affect the other devices. The second major change introduced by 10BaseT was that is was impossible to build a
10BaseT network by simply connecting all the twisted pairs together. All the twisted pairs must be connected to
a relay that operates in the physical layer. This relay is called an Ethernet hub. A hub is thus a physical layer
relay that receives an electrical signal on one of its interfaces, regenerates the signal and transmits it over all its
other interfaces. Some hubs are also able to convert the electrical signal from one physical layer to another (e.g.
10BaseT to 10Base2 conversion).
215
Comments
Thick coaxial cable, 500m
Thin coaxial cable, 185m
Two pairs of category 3+ UTP
10 Mb/s over optical fiber
Category 5 UTP or STP, 100 m maximum
Two multimode optical fiber, 2 km maximum
Two pairs shielded twisted pair, 25m maximum
Two multimode or single mode optical fibers with lasers
Optical fiber but also Category 6 UTP
Optical fiber (experiences are performed with copper)
Ethernet Switches
Increasing the physical layer bandwidth as in Fast Ethernet was only one of the solutions to improve the performance of Ethernet LANs. A second solution was to replace the hubs with more intelligent devices. As Ethernet
hubs operate in the physical layer, they can only regenerate the electrical signal to extend the geographical reach
of the network. From a performance perspective, it would be more interesting to have devices that operate in the
datalink layer and can analyse the destination address of each frame and forward the frames selectively on the link
that leads to the destination. Such devices are usually called Ethernet switches 65 . An Ethernet switch is a relay
that operates in the datalink layer as is illustrated in the figure below.
An Ethernet switch understands the format of the Ethernet frames and can selectively forward frames over each
interface. For this, each Ethernet switch maintains a MAC address table. This table contains, for each MAC
address known by the switch, the identifier of the switchs port over which a frame sent towards this address must
be forwarded to reach its destination. This is illustrated below with the MAC address table of the bottom switch.
When the switch receives a frame destined to address B, it forwards the frame on its South port. If it receives a
frame destined to address D, it forwards it only on its North port.
65 The first Ethernet relays that operated in the datalink layers were called bridges. In practice, the main difference between switches and
bridges is that bridges were usually implemented in software while switches are hardware-based devices. Throughout this text, we always use
216
217
One of the selling points of Ethernet networks is that, thanks to the utilisation of 48 bits MAC addresses, an
Ethernet LAN is plug and play at the datalink layer. When two hosts are attached to the same Ethernet segment or
hub, they can immediately exchange Ethernet frames without requiring any configuration. It is important to retain
this plug and play capability for Ethernet switches as well. This implies that Ethernet switches must be able to build
their MAC address table automatically without requiring any manual configuration. This automatic configuration
is performed by the MAC address learning algorithm that runs on each Ethernet switch. This algorithm extracts
the source address of the received frames and remembers the port over which a frame from each source Ethernet
address has been received. This information is inserted into the MAC address table that the switch uses to forward
frames. This allows the switch to automatically learn the ports that it can use to reach each destination address,
provided that this host has previously sent at least one frame. This is not a problem since most upper layer
protocols use acknowledgements at some layer and thus even an Ethernet printer sends Ethernet frames as well.
The pseudo-code below details how an Ethernet switch forwards Ethernet frames. It first updates its MAC address
table with the source address of the frame. The MAC address table used by some switches also contains a
timestamp that is updated each time a frame is received from each known source address. This timestamp is
used to remove from the MAC address table entries that have not been active during the last n minutes. This limits
the growth of the MAC address table, but also allows hosts to move from one port to another. The switch uses its
MAC address table to forward the received unicast frame. If there is an entry for the frames destination address
in the MAC address table, the frame is forwarded selectively on the port listed in this entry. Otherwise, the switch
does not know how to reach the destination address and it must forward the frame on all its ports except the port
from which the frame has been received. This ensures that the frame will reach its destination, at the expense of
some unnecessary transmissions. These unnecessary transmissions will only last until the destination has sent its
first frame. Multicast and Broadcast frames are also forwarded in a similar way.
# Arrival of frame F on port P
# Table : MAC address table dictionary : addr->port
# Ports : list of all ports on the switch
src=F.SourceAddress
dst=F.DestinationAddress
Table[src]=P #src heard on port P
if isUnicast(dst) :
if dst in Table:
ForwardFrame(F,Table[dst])
else:
for o in Ports :
if o!= P : ForwardFrame(F,o)
else:
# multicast or broadcast destination
for o in Ports :
if o!= P : ForwardFrame(F,o)
218
The MAC address learning algorithm combined with the forwarding algorithm work well in a tree-shaped network
such as the one shown above. However, to deal with link and switch failures, network administrators often add
redundant links to ensure that their network remains connected even after a failure. Let us consider what happens
in the Ethernet network shown in the figure below.
219
In addition to the identifier discussed above, the network administrator can also configure a cost to be associated
to each switch port. Usually, the cost of a port depends on its bandwidth and the [802.1d] standard recommends
the values below. Of course, the network administrator may choose other values. We will use the notation cost[p]
to indicate the cost associated to port p in this section.
220
Bandwidth
10 Mbps
100 Mbps
1 Gbps
10 Gbps
100 Gbps
Cost
2000000
200000
20000
2000
200
The Spanning Tree Protocol uses its own terminology that we illustrate in the figure above. A switch port can be
in three different states : Root, Designated and Blocked. All the ports of the root switch are in the Designated
state. The state of the ports on the other switches is determined based on the BPDU received on each port.
The Spanning Tree Protocol uses the ordering relationship to build the spanning tree. Each switch listens to
BPDUs on its ports. When BPDU=<R,c,T,p> is received on port q, the switch computes the ports priority
vector: V[q]=<R,c+cost[q],T,p,q> , where cost[q] is the cost associated to the port over which the BPDU was
received. The switch stores in a table the last priority vector received on each port. The switch then compares its
own identifier with the smallest root identifier stored in this table. If its own identifier is smaller, then the switch
is the root of the spanning tree and is, by definition, at a distance 0 of the root. The BPDU of the switch is then
<R,0,R,p>, where R is the switch identifier and p will be set to the port number over which the BPDU is sent.
Otherwise, the switch chooses the best priority vector from its table, bv=<R,c,T,p>. The port over which this best
priority vector was learned is the switch port that is closest to the root switch. This port becomes the Root port of
the switch. There is only one Root port per switch. The switch can then compute its BPDU as BPDU=<R,c,S,p>
, where R is the root identifier, c the cost of the best priority vector, S the identifier of the switch and p will be
replaced by the number of the port over which the BPDU will be sent. The switch can then determine the state
of all its ports by comparing its own BPDU with the priority vector received on each port. If the switchs BPDU
is better than the priority vector of this port, the port becomes a Designated port. Otherwise, the port becomes a
Blocked port.
The state of each port is important when considering the transmission of BPDUs. The root switch regularly sends
its own BPDU over all of its (Designated) ports. This BPDU is received on the Root port of all the switches that
are directly connected to the root switch. Each of these switches computes its own BPDU and sends this BPDU
over all its Designated ports. These BPDUs are then received on the Root port of downstream switches, which
then compute their own BPDU, etc. When the network topology is stable, switches send their own BPDU on
all their Designated ports, once they receive a BPDU on their Root port. No BPDU is sent on a Blocked port.
Switches listen for BPDUs on their Blocked and Designated ports, but no BPDU should be received over these
ports when the topology is stable. The utilisation of the ports for both BPDUs and data frames is summarised in
the table below.
Port state
Blocked
Root
Designated
Receives BPDUs
yes
yes
yes
Sends BPDU
no
no
yes
To illustrate the operation of the Spanning Tree Protocol, let us consider the simple network topology in the figure
below.
Assume that Switch4 is the first to boot. It sends its own BPDU=<4,0,4,?> on its two ports. When Switch1
boots, it sends BPDU=<1,0,1,1>. This BPDU is received by Switch4, which updates its table and computes a
new BPDU=<1,3,4,?>. Port 1 of Switch4 becomes the Root port while its second port is still in the Designated
state.
Assume now that Switch9 boots and immediately receives Switch1 s BPDU on port 1. Switch9 computes its own
BPDU=<1,1,9,?> and port 1 becomes the Root port of this switch. This BPDU is sent on port 2 of Switch9 and
reaches Switch4. Switch4 compares the priority vector built from this BPDU (i.e. <1,2,9,2>) and notices that it is
better than Switch4 s BPDU=<1,3,4,2>. Thus, port 2 becomes a Blocked port on Switch4.
During the computation of the spanning tree, switches discard all received data frames, as at that time the network
topology is not guaranteed to be loop-free. Once that topology has been stable for some time, the switches again
start to use the MAC learning algorithm to forward data frames. Only the Root and Designated ports are used
to forward data frames. Switches discard all the data frames received on their Blocked ports and never forward
frames on these ports.
Switches, ports and links can fail in a switched Ethernet network. When a failure occurs, the switches must be
able to recompute the spanning tree to recover from the failure. The Spanning Tree Protocol relies on regular
3.16. Datalink layer technologies
221
223
of WiFi devices. The development of this technology started in the late 1980s with the WaveLAN proprietary
wireless network. WaveLAN operated at 2 Mbps and used different frequency bands in different regions of the
world. In the early 1990s, the IEEE created the 802.11 working group to standardise a family of wireless network
technologies. This working group was very prolific and produced several wireless networking standards that use
different frequency ranges and different physical layers. The table below provides a summary of the main 802.11
standards.
Standard
802.11
802.11a
802.11b
802.11g
802.11n
Frequency
2.4 GHz
5 GHz
2.4 GHz
2.4 GHz
2.4/5 GHz
Typical throughput
0.9 Mbps
23 Mbps
4.3 Mbps
19 Mbps
74 Mbps
Max bandwidth
2 Mbps
54 Mbps
11 Mbps
54 Mbps
150 Mbps
When developing its family of standards, the IEEE 802.11 working group took a similar approach as the IEEE
802.3 working group that developed various types of physical layers for Ethernet networks. 802.11 networks use
the CSMA/CA Medium Access Control technique described earlier and they all assume the same architecture and
use the same frame format.
The architecture of WiFi networks is slightly different from the Local Area Networks that we have discussed until
now. There are, in practice, two main types of WiFi networks : independent or adhoc networks and infrastructure
networks 66 . An independent or adhoc network is composed of a set of devices that communicate with each other.
These devices play the same role and the adhoc network is usually not connected to the global Internet. Adhoc
networks are used when for example a few laptops need to exchange information or to connect a computer with a
WiFi printer.
224
225
226
227
access point. When a WiFi station starts, it listens to beacon frames to find the available SSIDs. To be allowed to
send and receive frames via an access point, a WiFi station must be associated to this access point. If the access
point does not use any security mechanism to secure the wireless transmission, the WiFi station simply sends an
Association request frame to its preferred access point (usually the access point that it receives with the strongest
radio signal). This frame contains some parameters chosen by the WiFi station and the SSID that it requests to
join. The access point replies with an Association response frame if it accepts the WiFI station.
228
CHAPTER 4
Part 3: Practice
229
CHAPTER 5
Appendices
5.1 Glossary
AIMD Additive Increase, Multiplicative Decrease. A rate adaption algorithm used notably by TCP where a host
additively increases its transmission rate when the network is not congested and multiplicatively decreases
when congested is detected.
anycast a transmission mode where an information is sent from one source to one receiver that belongs to a
specified group
API Application Programming Interface
ARP The Address Resolution Protocol is a protocol used by IPv4 devices to obtain the datalink layer address
that corresponds to an IPv4 address on the local area network. ARP is defined in RFC 826
ARPANET The Advanced Research Project Agency (ARPA) Network is a network that was built by network
scientists in USA with funding from the ARPA of the US Ministry of Defense. ARPANET is considered as
the grandfather of todays Internet.
ascii The American Standard Code for Information Interchange (ASCII) is a character-encoding scheme that
defines a binary representation for characters. The ASCII table contains both printable characters and
control characters. ASCII characters were encoded in 7 bits and only contained the characters required to
write text in English. Other character sets such as Unicode have been developed later to support all written
languages.
ASN.1 The Abstract Syntax Notation One (ASN.1) was designed by ISO and ITU-T. It is a standard and flexible
notation that can be used to describe data structures for representing, encoding, transmitting, and decoding
data between applications. It was designed to be used in the Presentation layer of the OSI reference model
but is now used in other protocols such as SNMP.
ATM Asynchronous Transfer Mode
BGP The Border Gateway Protocol is the interdomain routing protocol used in the global Internet.
BNF A Backus-Naur Form (BNF) is a formal way to describe a language by using syntactic and lexical rules.
BNFs are frequently used to define programming languages, but also to define the messages exchanged
between networked applications. RFC 5234 explains how a BNF must be written to specify an Internet
protocol.
broadcast a transmission mode where is same information is sent to all nodes in the network
CIDR Classless Inter Domain Routing is the current address allocation architecture for IPv4. It was defined in
RFC 1518 and RFC 4632.
dial-up line A synonym for a regular telephone line, i.e. a line that can be used to dial any telephone number.
DNS The Domain Name System is a distributed database that allows to map names on IP addresses.
DNS The Domain Name System is defined in RFC 1035
231
DNS The Domain Name System is a distributed database that can be queried by hosts to map names onto IP
addresses
eBGP An eBGP session is a BGP session between two directly connected routers that belong to two different
Autonomous Systems. Also called an external BGP session.
EGP Exterior Gateway Protocol. Synonym of interdomain routing protocol
EIGRP The Enhanced Interior Gateway Routing Protocol (EIGRP) is a proprietary intradomain routing protocol
that is often used in enterprise networks. EIGRP uses the DUAL algorithm described in [Garcia1993].
frame a frame is the unit of information transfer in the datalink layer
Frame-Relay A wide area networking technology using virtual circuits that is deployed by telecom operators.
ftp The File Transfer Protocol defined in RFC 959 has been the de facto protocol to exchange files over the
Internet before the widespread adoption of HTTP RFC 2616
FTP The File Transfer Protocol is defined in RFC 959
hosts.txt A file that initially contained the list of all Internet hosts with their IPv4 address. As the network grew,
this file was replaced by the DNS, but each host still maintains a small hosts.txt file that can be used when
DNS is not available.
HTML The HyperText Markup Language specifies the structure and the syntax of the documents that are exchanged on the world wide web. HTML is maintained by the HTML working group of the W3C
HTTP The HyperText Transport Protocol is defined in RFC 2616
hub A relay operating in the physical layer.
IANA The Internet Assigned Numbers Authority (IANA) is responsible for the coordination of the DNS Root,
IP addressing, and other Internet protocol resources
iBGP An iBGP session is a BGP between two routers belonging to the same Autonomous System. Also called
an internal BGP session.
ICANN The Internet Corporation for Assigned Names and Numbers (ICANN) coordinates the allocation of
domain names, IP addresses and AS numbers as well protocol parameters. It also coordinates the operation
and the evolution of the DNS root name servers.
IETF The Internet Engineering Task Force is a non-profit organisation that develops the standards for the protocols used in the Internet. The IETF mainly covers the transport and network layers. Several application layer
protocols are also standardised within the IETF. The work in the IETF is organised in working groups. Most
of the work is performed by exchanging emails and there are three IETF meetings every year. Participation
is open to anyone. See http://www.ietf.org
IGP Interior Gateway Protocol. Synonym of intradomain routing protocol
IGRP The Interior Gateway Routing Protocol (IGRP) is a proprietary intradomain routing protocol that uses
distance vector. IGRP supports multiple metrics for each route but has been replaced by EIGRP
IMAP The Internet Message Access Protocol is defined in RFC 3501
IMAP The Internet Message Access Protocol (IMAP), defined in RFC 3501, is an application-level protocol
that allows a client to access and manipulate the emails stored on a server. With IMAP, the email messages
remain on the server and are not downloaded on the client.
Internet a public internet, i.e. a network composed of different networks that are running IPv4 or IPv6
internet an internet is an internetwork, i.e. a network composed of different networks. The Internet is a very
popular internetwork, but other internets have been used in the path.
inverse query For DNS servers and resolvers, an inverse query is a query for the domain name that corresponds
to a given IP address.
IP Internet Protocol is the generic term for the network layer protocol in the TCP/IP protocol suite. IPv4 is
widely used today and IPv6 is expected to replace IPv4
232
Chapter 5. Appendices
IPv4 is the version 4 of the Internet Protocol, the connectionless network layer protocol used in most of the
Internet today. IPv4 addresses are encoded as a 32 bits field.
IPv6 is the version 6 of the Internet Protocol, the connectionless network layer protocol which is intended to
replace IPv4 . IPv6 addresses are encoded as a 128 bits field.
IS-IS Intermediate System- Intermediate System. A link-state intradomain routing that was initially defined for
the ISO CLNP protocol but was extended to support IPv4 and IPv6. IS-IS is often used in ISP networks. It
is defined in [ISO10589]
ISN The Initial Sequence Number of a TCP connection is the sequence number chosen by the client ( resp. server)
that is placed in the SYN (resp. SYN+ACK) segment during the establishment of the TCP connection.
ISO The International Standardization Organisation is an agency of the United Nations that is based in Geneva
and develop standards on various topics. Within ISO, country representatives vote to approve or reject standards. Most of the work on the development of ISO standards is done in expert working groups. Additional
information about ISO may be obtained from http://www.iso.int
ISO The International Standardization Organisation
ISO-3166 An ISO standard that defines codes to represent countries and their subdivisions.
http://www.iso.org/iso/country_codes.htm
See
ISP An Internet Service Provider, i.e. a network that provides Internet access to its clients.
ITU The International Telecommunication Union is a United Nations agency whose purpose is to develop standards for the telecommunication industry. It was initially created to standardise the basic telephone system
but expanded later towards data networks. The work within ITU is mainly done by network specialists from
the telecommunication industry (operators and vendors). See http://www.itu.int for more information
IXP Internet eXchange Point. A location where routers belonging to different domains are attached to the same
Local Area Network to establish peering sessions and exchange packets. See http://www.euro-ix.net/ or
http://en.wikipedia.org/wiki/List_of_Internet_exchange_points_by_size for a partial list of IXPs.
LAN Local Area Network
leased line A telephone line that is permanently available between two endpoints.
MAN Metropolitan Area Network
MIME The Multipurpose Internet Mail Extensions (MIME) defined in RFC 2045 are a set of extensions to the
format of email messages that allow to use non-ASCII characters inside mail messages. A MIME message
can be composed of several different parts each having a different format.
MIME document A MIME document is a document, encoded by using the MIME format.
minicomputer A minicomputer is a multi-user system that was typically used in the 1960s/1970s
to serve departments.
See the corresponding wikipedia article for additional information :
http://en.wikipedia.org/wiki/Minicomputer
modem A modem (modulator-demodulator) is a device that encodes (resp. decodes) digital information by modulating (resp. demodulating) an analog signal. Modems are frequently used to transmit digital information
over telephone lines and radio links. See http://en.wikipedia.org/wiki/Modem for a survey of various types
of modems
MSS A TCP option used by a TCP entity in SYN segments to indicate the Maximum Segment Size that it is able
to receive.
multicast a transmission mode where an information is sent efficiently to all the receivers that belong to a given
group
nameserver A server that implements the DNS protocol and can answer queries for names inside its own domain.
NAT A Network Address Translator is a middlebox that translates IP packets.
NBMA A Non Broadcast Mode Multiple Access Network is a subnetwork that supports multiple hosts/routers
but does not provide an efficient way of sending broadcast frames to all devices attached to the subnetwork.
ATM subnetworks are an example of NBMA networks.
5.1. Glossary
233
network-byte order Internet protocol allow to transport sequences of bytes. These sequences of bytes are sufficient to carry ASCII characters. The network-byte order refers to the Big-Endian encoding for 16 and 32
bits integer. See http://en.wikipedia.org/wiki/Endianness
NFS The Network File System is defined in RFC 1094
NTP The Network Time Protocol is defined in RFC 1305
OSI Open Systems Interconnection. A set of networking standards developed by ISO including the 7 layers OSI
reference model.
OSPF Open Shortest Path First. A link-state intradomain routing protocol that is often used in enterprise and
ISP networks. OSPF is defined in and RFC 2328 and RFC 5340
packet a packet is the unit of information transfer in the network layer
PBL Problem-based learning is a teaching approach that relies on problems.
POP The Post Office Protocol is defined in RFC 1939
POP The Post Office Protocol (POP), defined RFC 1939, is an application-level protocol that allows a client to
download email messages stored on a server.
resolver A server that implements the DNS protocol and can resolve queries. A resolver usually serves a set
of clients (e.g. all hosts in campus or all clients of a given ISP). It sends DNS queries to nameservers
everywhere on behalf of its clients and stores the received answers in its cache. A resolver must know the
IP addresses of the root nameservers.
RIP Routing Information Protocol. An intradomain routing protocol based on distance vectors that is sometimes
used in enterprise networks. RIP is defined in RFC 2453.
RIR Regional Internet Registry. An organisation that manages IP addresses and AS numbers on behalf of IANA.
root nameserver A name server that is responsible for the root of the domain names hierarchy. There are
currently a dozen root nameservers and each DNS resolver See http://www.root-servers.org/ for more information about the operation of these root servers.
round-trip-time The round-trip-time (RTT) is the delay between the transmission of a segment and the reception
of the corresponding acknowledgement in a transport protocol.
router A relay operating in the network layer.
RPC Several types of remote procedure calls have been defined. The RPC mechanism defined in RFC 5531 is
used by applications such as NFS
SDU (Service Data Unit) a Service Data Unit is the unit information transferred between applications
segment a segment is the unit of information transfer in the transport layer
SMTP The Simple Mail Transfer Protocol is defined in RFC 821
SNMP The Simple Network Management Protocol is a management protocol defined for TCP/IP networks.
socket A low-level API originally defined on Berkeley Unix to allow programmers to develop clients and servers.
spoofed packet A packet is said to be spoofed when the sender of the packet has used as source address a
different address than its own.
SSH The Secure Shell (SSH) Transport Layer Protocol is defined in RFC 4253
standard query For DNS servers and resolvers, a standard query is a query for a A or a AAAA record. Such a
query typically returns an IP address.
switch A relay operating in the datalink layer.
SYN cookie The SYN cookies is a technique used to compute the initial sequence number (ISN)
TCB The Transmission Control Block is the set of variables that are maintained for each established TCP connection by a TCP implementation.
TCP The Transmission Control Protocol is a protocol of the transport layer in the TCP/IP protocol suite that
provides a reliable bytestream connection-oriented service on top of IP
234
Chapter 5. Appendices
5.2 Bibliography
Whenever possible, the bibliography includes stable hypertext links to the references cited.
5.2. Bibliography
235
Bibliography
[802.11] LAN/MAN Standards Committee of the IEEE Computer Society. IEEE Standard for Information Technology - Telecommunications and information exchange between systems - local and metropolitan area networks - specific requirements - Part 11 : Wireless LAN Medium Access Control (MAC) and Physical Layer
(PHY) Specifications. IEEE, 1999.
[802.1d] LAN/MAN Standards Committee of the IEEE Computer Society, IEEE Standard for Local and
metropolitan area networks Media Access Control (MAC) Bridges , IEEE Std 802.1DTM-2004, 2004,
[802.1q] LAN/MAN Standards Committee of the IEEE Computer Society, IEEE Standard for Local and
metropolitan area networks Virtual Bridged Local Area Networks, 2005,
[802.2] IEEE 802.2-1998 (ISO/IEC 8802-2:1998), IEEE Standard for Information technology
Telecommunications and information exchange between systemsLocal and metropolitan area networksSpecific requirementsPart 2:
Logical Link Control. Available from
http://standards.ieee.org/getieee802/802.2.html
[802.3] LAN/MAN Standards Committee of the IEEE Computer Society. IEEE Standard for Information
Technology - Telecommunications and information exchange between systems - local and metropolitan area networks - specific requirements - Part 3 : Carrier Sense multiple access with collision
detection (CSMA/CD) access method and physical layer specification. IEEE, 2000. Available from
http://standards.ieee.org/getieee802/802.3.html
[802.5] LAN/MAN Standards Committee of the IEEE Computer Society. IEEE Standard for Information
technologyTelecommunications and information exchange between systemsLocal and metropolitan area
networksSpecific requirementsPart 5: Token Ring Access Method and Physical Layer Specification. IEEE,
1998. available from http://standards.ieee.org/getieee802
[ACO+2006] Augustin, B., Cuvellier, X., Orgogozo, B., Viger, F., Friedman, T., Latapy, M., Magnien, C., Teixeira, R., Avoiding traceroute anomalies with Paris traceroute, Internet Measurement Conference, October
2006, See also http://www.paris-traceroute.net/
[AS2004] Androutsellis-Theotokis, S. and Spinellis, D.. 2004. A survey of peer-to-peer content distribution technologies. ACM Comput. Surv. 36, 4 (December 2004), 335-371.
[ATLAS2009] Labovitz, C., Iekel-Johnson, S., McPherson, D., Oberheide, J. and Jahanian, F., Internet interdomain traffic. In Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM (SIGCOMM 10).
ACM, New York, NY, USA, 75-86.
[AW05] Arlitt, M. and Williamson, C. 2005. An analysis of TCP reset behaviour on the internet. SIGCOMM
Comput. Commun. Rev. 35, 1 (Jan. 2005), 37-44.
[Abramson1970] Abramson, N., THE ALOHA SYSTEM: another alternative for computer communications. In
Proceedings of the November 17-19, 1970, Fall Joint Computer Conference (Houston, Texas, November 17 19, 1970). AFIPS 70 (Fall). ACM, New York, NY, 281-285.
[B1989] Berners-Lee, T., Information Management: A Proposal, March 1989
237
238
Bibliography
[Cheswick1990] Cheswick, B., An Evening with Berferd In Which a Cracker is Lured, Endured, and Studied,
Proc. Winter USENIX Conference, 1990, pp. 163-174
[Clark88] Clark D., The Design Philosophy of the DARPA Internet Protocols, Computer Communications Review 18:4, August 1988, pp. 106-114
[Comer1988] Comer, D., Internetworking with TCP/IP : principles, protocols & architecture, Prentice Hall, 1988
[Comer1991] Comer D., Internetworking With TCP/IP : Design Implementation and Internals, Prentice Hall,
1991
[Cohen1980] Cohen, D., On Holy
http://www.ietf.org/rfc/ien/ien137.txt
Wars
and
Plea
for
Peace,
IEN
137,
April
1980,
[DC2009] Donahoo, M., Calvert, K., TCP/IP Sockets in C: Practical Guide for Programmers , Morgan Kaufman,
2009,
[DIX] Digital, Intel, Xerox, The Ethernet: a local area network: data link layer and physical layer specifications.
SIGCOMM Comput. Commun. Rev. 11, 3 (Jul. 1981), 20-66.
[DKF+2007] Dimitropoulos, X., Krioukov, D., Fomenkov, M., Huffaker, B., Hyun, Y., Claffy, K., Riley, G., AS
Relationships: Inference and Validation, ACM SIGCOMM Computer Communication Review (CCR), Jan.
2007
[DP1981] Dalal, Y. K. and Printis, R. S., 48-bit absolute internet and Ethernet host numbers. In Proceedings of
the Seventh Symposium on Data Communications (Mexico City, Mexico, October 27 - 29, 1981). SIGCOMM
81. ACM, New York, NY, 240-245.
[DRC+2010] Dukkipati, N., Refice, T., Cheng, Y., Chu, J., Herbert, T., Agarwal, A., Jain, A., Sutin, N., An
Argument for Increasing TCPs Initial Congestion Window, ACM SIGCOMM Computer Communications
Review, vol. 40 (2010), pp. 27-33
[Dubuisson2000] 15. Dubuisson,
ASN.1 :
Communication between Heterogeneous
<http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html#dubuisson>,
gan Kauffman, 2000
Systems
Mor-
[Dunkels2003] Dunkels, A., Full TCP/IP for 8-Bit Architectures. In Proceedings of the first international conference on mobile applications, systems and services (MOBISYS 2003), San Francisco, May 2003.
[DT2007] Donnet, B. and Friedman, T., Internet Topology Discovery: a Survey. IEEE Communications Surveys
and Tutorials, 9(4):2-15, December 2007
[DYGU2004] Davik, F. Yilmaz, M. Gjessing, S. Uzun, N., IEEE 802.17 resilient packet ring tutorial, IEEE
Communications Magazine, Mar 2004, Vol 42, N 3, p. 112-118
[Dijkstra1959] Dijkstra, E., A Note on Two Problems in Connection with Graphs. Numerische Mathematik,
1:269- 271, 1959
[FDDI] ANSI. Information systems - Fiber Distributed Data Interface (FDDI) - token ring media access control
(MAC). ANSI X3.139-1987 (R1997), 1997
[Fletcher1982] Fletcher, J., An Arithmetic Checksum for Serial Transmissions, Communications, IEEE Transactions on, Jan. 1982, Vol. 30, N. 1, pp. 247-252
[FFEB2005] Francois, P., Filsfils, C., Evans, J., and Bonaventure, O., Achieving sub-second IGP convergence in
large IP networks. SIGCOMM Comput. Commun. Rev. 35, 3 (Jul. 2005), 35-44.
[NGB+1997] Nielsen, H., Gettys, J., Baird-Smith, A., Prudhommeaux, E., Wium Lie, H., and Lilley, C. Network
performance effects of HTTP/1.1, CSS1, and PNG. SIGCOMM Comput. Commun. Rev. 27, 4 (October 1997),
155-166.
[FJ1993] Sally Floyd and Van Jacobson. 1993. Random early detection gateways for congestion avoidance.
IEEE/ACM Trans. Netw. 1, 4 (August 1993), 397-413.
[FJ1994] Floyd, S., and Jacobson, V., The Synchronization of Periodic Routing Messages, IEEE/ACM Transactions on Networking, V.2 N.2, p. 122-136, April 1994
[FLM2008] Fuller, V., Lear, E., Meyer, D., Reclassifying 240/4 as usable unicast address space, Internet draft,
March 2008, workin progress
Bibliography
239
[FRT2002] Fortz, B. Rexford, J. ,Thorup, M., Traffic engineering with traditional IP routing protocols, IEEE
Communication Magazine, October 2002
[FTY99] Theodore Faber, Joe Touch, and Wei Yue, The TIME-WAIT state in TCP and Its Effect on Busy Servers,
Proc. Infocom 99, pp. 1573
[Feldmeier95] Feldmeier, D. C., Fast software implementation of error detection codes. IEEE/ACM Trans. Netw.
3, 6 (Dec. 1995), 640-651.
[GAVE1999] Govindan, R., Alaettinoglu, C., Varadhan, K., Estrin, D., An Architecture for Stable, Analyzable
Internet Routing, IEEE Network Magazine, Vol. 13, No. 1, pp. 2935, January 1999
[GC2000] Grier, D., Campbell, M., A social history of Bitnet and Listserv, 1985-1991, Annals of the History of
Computing, IEEE, Volume 22, Issue 2, Apr-Jun 2000, pp. 32 - 41
[Genilloud1990] Genilloud, G., X.400 MHS: first steps towards an EDI communication standard. SIGCOMM
Comput. Commun. Rev. 20, 2 (Apr. 1990), 72-86.
[GGR2001] Gao, L., Griffin, T., Rexford, J., Inherently safe backup routing with BGP, Proc. IEEE INFOCOM,
April 2001
[GN2011] Gettys, J., Nichols, K., Bufferbloat: dark buffers in the internet. Communications of the ACM 55, no.
1 (2012): 57-65.
[GR2001] Gao, L., Rexford, J., Stable Internet routing without global coordination, IEEE/ACM Transactions on
Networking, December 2001, pp. 681-692
[GSW2002] Griffin, T. G., Shepherd, F. B., and Wilfong, G., The stable paths problem and interdomain routing.
IEEE/ACM Trans. Netw. 10, 2 (Apr. 2002), 232-243
[GW1999] Griffin, T. G. and Wilfong, G., An analysis of BGP convergence properties. SIGCOMM Comput.
Commun. Rev. 29, 4 (Oct. 1999), 277-288.
[GW2002] Griffin, T. and Wilfong, G. T., Analysis of the MED Oscillation Problem in BGP. In Proceedings of the
10th IEEE international Conference on Network Protocols (November 12 - 15, 2002). ICNP. IEEE Computer
Society, Washington, DC, 90-99
[Garcia1993] Garcia-Lunes-Aceves, J., Loop-Free Routing Using Diffusing Computations, IEEE/ACM Transactions on Networking, Vol. 1, No, 1, Feb. 1993
[Gast2002] Gast, M., 802.11 Wireless Networks : The Definitive Guide, OReilly, 2002
[Gill2004] Gill, V. , Lack of Priority Queuing Considered Harmful, ACM Queue, December 2004
[Goralski2009] Goralski, W., The Illustrated network : How TCP/IP works in a modern network, Morgan Kaufmann, 2009
[HFPMC2002] Huffaker, B., Fomenkov, M., Plummer, D., Moore, D., Claffy, K., Distance Metrics in the Internet,
Presented at the IEEE International Telecommunications Symposium (ITS) in 2002.
[HRX2008] Ha, S., Rhee, I., and Xu, L., CUBIC: a new TCP-friendly high-speed TCP variant. SIGOPS Oper.
Syst. Rev. 42, 5 (Jul. 2008), 64-74.
[HV2008] Hogg, S. Vyncke, E., IPv6 Security, Cisco Press, 2008
[IMHM2013] Ishihara, K., Mukai, M., Hiromi, R., Mawatari, M., Packet Filter and Route Filter Recommendation
for IPv6 at xSP routers, 2013
[ISO10589] ISO, Intermediate System to Intermediate System intra-domain routeing information exchange protocol for use in conjunction with the protocol for providing the connectionless-mode network service (ISO
8473) , 2002
[Jacobson1988] Jacobson, V., Congestion avoidance and control. In Symposium Proceedings on Communications Architectures and Protocols (Stanford, California, United States, August 16 - 18, 1988). V. Cerf, Ed.
SIGCOMM 88. ACM, New York, NY, 314-329.
[Jain1990] Jain, R., Congestion control in computer networks : Issues and trends, IEEE Network Magazine, May
1990, pp. 24-30
240
Bibliography
[JLT2013] Jesup, R., Loreto, S., Tuexen, M., RTCWeb Data Channels, Internet draft, draft-ietf-rtcweb-datachannel, work in progress, 2013
[JSBM2002] Jung, J., Sit, E., Balakrishnan, H., and Morris, R. 2002. DNS performance and the effectiveness of
caching. IEEE/ACM Trans. Netw. 10, 5 (Oct. 2002), 589-603.
[JSON-RPC2] JSON-RPC Working group, JSON-RPC 2.0 Specification, available on http://www.jsonrpc.org,
2010
[Kerrisk2010] Kerrisk, M., The Linux Programming Interface, No Starch Press, 2010
[KM1995] Kent, C. A. and Mogul, J. C., Fragmentation considered harmful. SIGCOMM Comput. Commun. Rev.
25, 1 (Jan. 1995), 75-87.
[KNT2013] Khlewind, M., Neuner, S., Trammell, B., On the state of ECN and TCP Options on the Internet.
Proceedings of the 14th Passive and Active Measurement conference (PAM 2013), Hong Kong, March 2013
[KP91] Karn, P. and Partridge, C., Improving round-trip time estimates in reliable transport protocols. ACM
Trans. Comput. Syst. 9, 4 (Nov. 1991), 364-373.
[KPD1985] Karn, P., Price, H., Diersing, R., Packet radio in amateur service, IEEE Journal on Selected Areas in
Communications, 3, May, 1985
[KPS2003] Kaufman, C., Perlman, R., and Sommerfeld, B. DoS protection for UDP-based protocols. In Proceedings of the 10th ACM Conference on Computer and Communications Security (Washington D.C., USA,
October 27 - 30, 2003). CCS 03. ACM, New York, NY, 2-7.
[KR1995] Kung, N.T. Morris, R., Credit-based flow control for ATM networks, IEEE Network, Mar/Apr 1995,
Volume: 9, Issue: 2, pages: 40-48
[KT1975] Kleinrock, L., Tobagi, F., Packet Switching in Radio Channels: Part ICarrier Sense Multiple-Access
Modes and their Throughput-Delay Characteristics, IEEE Transactions on Communications, Vol. COM-23,
No. 12, pp. 1400-1416, December 1975.
[KW2009] Katz, D., Ward, D., Bidirectional Forwarding Detection, RFC 5880, June 2010
[KZ1989] Khanna, A. and Zinky, J. 1989. The revised ARPANET routing metric. SIGCOMM Comput. Commun.
Rev. 19, 4 (Aug. 1989), 45-56.
[KuroseRoss09] Kurose J. and Ross K., Computer networking : a top-down approach featuring the Internet,
Addison-Wesley, 2009
[Licklider1963] Licklider, J., Memorandum For Members and Affiliates of the Intergalactic Computer Network,
1963
[LCCD09] Leiner, B. M., Cerf, V. G., Clark, D. D., Kahn, R. E., Kleinrock, L., Lynch, D. C., Postel, J., Roberts,
L. G., and Wolff, S., A brief history of the internet. SIGCOMM Comput. Commun. Rev. 39, 5 (Oct. 2009),
22-31.
[LCP2005] Eng Keong Lua, Crowcroft, J., Pias, M., Sharma, R., Lim, S., A survey and comparison of peer-topeer overlay network schemes, Communications Surveys & Tutorials, IEEE, Volume: 7 , Issue: 2, 2005, pp.
72-93
[LeB2009] Leroy, D. and O. Bonaventure, Preparing network configurations for IPv6 renumbering, International
of Network Management, 2009
[LFJLMT] Leffler, S., Fabry, R., Joy, W., Lapsley, P., Miller, S., Torek, C., An Advanced 4.4BSD Interprocess
Communication Tutorial, 4.4 BSD Programmers Supplementary Documentation
[LNO1996] 20. (a) Lakshman, Arnold Neidhardt, and Teunis J. Ott. 1996. The drop from front strategy in
TCP and in TCP over ATM. INFOCOM96, Vol. 3. IEEE Computer Society, Washington, DC,
USA, 1242-1250.
[LSP1982] Lamport, L., Shostak, R., and Pease, M., The Byzantine Generals Problem. ACM Trans. Program.
Lang. Syst. 4, 3 (Jul. 1982), 382-401.
[Leboudec2008] Leboudec, J.-Y., Rate Adaptation Congestion Control and Fairness : a tutorial, Dec. 2008
[Malamud1991] Malamud, C., Analyzing DECnet/OSI phase V, Van Nostrand Reinhold, 1991
Bibliography
241
[McFadyen1976] McFadyen, J., Systems Network Architecture: An overview, IBM Systems Journal, Vol. 15, N.
1, pp. 4-23, 1976
[McKusick1999] McKusick, M., Twenty Years of Berkeley Unix : From AT&T-Owned to Freely
Redistributable, in Open Sources: Voices from the Open Source Revolution, Oreilly, 1999,
http://oreilly.com/catalog/opensources/book/toc.html
[ML2011] Minei I. and Lucek J. ,MPLS-Enabled Applications: Emerging Developments and New
Technologies
<http://www.amazon.com/MPLS-Enabled-Applications-Developments-TechnologiesCommunications/dp/0470665459>_ (Wiley Series on Communications Networking & Distributed Systems),
Wiley, 2011
[MRR1979] McQuillan, J. M., Richer, I., and Rosen, E. C., An overview of the new routing algorithm for the
ARPANET. In Proceedings of the Sixth Symposium on Data Communications (Pacific Grove, California,
United States, November 27 - 29, 1979). SIGCOMM 79. ACM, New York, NY, 63-68.
[MRR1980] McQuillan, J.M., Richer, I., Rosen, E., The New Routing Algorithm for the ARPANET Communications, IEEE Transactions on , vol.28, no.5, pp.711,719, May 1980
[MSMO1997] Mathis, M., Semke, J., Mahdavi, J., and Ott, T. 1997. The macroscopic behavior of the TCP
congestion avoidance algorithm. SIGCOMM Comput. Commun. Rev. 27, 3 (Jul. 1997), 67-82.
[MSV1987] Molle, M., Sohraby, K., Venetsanopoulos, A., Space-Time Models of Asynchronous CSMA Protocols for Local Area Networks, IEEE Journal on Selected Areas in Communications, Volume: 5 Issue: 6, Jul
1987 Page(s): 956 -96
[MUF+2007] Mhlbauer, W., Uhlig, S., Fu, B., Meulle, M., and Maennel, O., In search for an appropriate granularity to model routing policies. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols For Computer Communications (Kyoto, Japan, August 27 - 31, 2007). SIGCOMM
07. ACM, New York, NY, 145-156.
[Malkin1999] Malkin, G., RIP: An Intra-Domain Routing Protocol, Addison Wesley, 1999
[Metcalfe1976] Metcalfe R., Boggs, D., Ethernet: Distributed packet-switching for local computer networks.
Communications of the ACM, 19(7):395404, 1976.
[Mills2006] Mills, D.L., Computer Network Time Synchronization: the Network Time Protocol. CRC Press,
March 2006, 304 pp.
[Miyakawa2008] Miyakawa, S., From IPv4 only To v4/v6 Dual Stack, IETF72 IAB Technical Plenary, July 2008
[Mogul1995] Mogul, J. , The case for persistent-connection HTTP. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols For Computer Communication (Cambridge, Massachusetts,
United States, August 28 - September 01, 1995). D. Oran, Ed. SIGCOMM 95. ACM, New York, NY, 299313.
[Moore] Moore, R., Packet switching history, http://rogerdmoore.ca/PS/
[Moy1998] Moy, J., OSPF: Anatomy of an Internet Routing Protocol, Addison Wesley, 1998
[Myers1998] Myers, B. A., A brief history of human-computer interaction technology. interactions 5, 2 (Mar.
1998), 44-54.
[Nelson1965] Nelson, T. H., Complex information processing: a file structure for the complex, the changing and
the indeterminate. In Proceedings of the 1965 20th National Conference (Cleveland, Ohio, United States,
August 24 - 26, 1965). L. Winner, Ed. ACM 65. ACM, New York, NY, 84-100.
[Paxson99] Paxson, V. , End-to-end Internet packet dynamics. SIGCOMM Comput. Commun. Rev. 27, 4 (Oct.
1997), 139-152.
[Perlman1985] Perlman, R., An algorithm for distributed computation of a spanning tree in an extended LAN.
SIGCOMM Comput. Commun. Rev. 15, 4 (Sep. 1985), 44-53.
[Perlman2000] Perlman, R., Interconnections : Bridges, routers, switches and internetworking protocols, 2nd
edition, Addison Wesley, 2000
[Perlman2004] Perlman, R., RBridges: Transparent Routing, Proc. IEEE Infocom , March 2004.
242
Bibliography
[Pouzin1975] Pouzin, L., The CYCLADES Network - Present state and development trends, Symposium on
Computer Networks, 1975 pp 8-13.
[Rago1993] Rago, S., UNIX System V network programming, Addison Wesley, 1993
[RE1989] Rochlis, J. A. and Eichin, M. W., With microscope and tweezers: the worm from MITs perspective.
Commun. ACM 32, 6 (Jun. 1989), 689-698.
[RFC20] Cerf, V., ASCII format for network interchange, RFC 20, Oct. 1969
[RFC768] Postel, J., User Datagram Protocol, RFC 768, Aug. 1980
[RFC789] Rosen, E., Vulnerabilities of network control protocols: An example, RFC 789, July 1981
[RFC791] Postel, J., Internet Protocol, RFC 791, Sep. 1981
[RFC792] Postel, J., Internet Control Message Protocol, RFC 792, Sep. 1981
[RFC793] Postel, J., Transmission Control Protocol, RFC 793, Sept. 1981
[RFC813] Clark, D., Window and Acknowledgement Strategy in TCP, RFC 813, July 1982
[RFC819] Su, Z. and Postel, J., Domain naming convention for Internet user applications, RFC 819, Aug. 1982
[RFC821] Postel, J., Simple Mail Transfer Protocol, RFC 821, Aug. 1982
[RFC822] Crocker, D., Standard for the format of ARPA Internet text messages, :rfc:822, Aug. 1982
[RFC826] Plummer, D., Ethernet Address Resolution Protocol: Or Converting Network Protocol Addresses to
48.bit Ethernet Address for Transmission on Ethernet Hardware, RFC 826, Nov. 1982
[RFC879] Postel, J., TCP maximum segment size and related topics, RFC 879, Nov. 1983
[RFC893] Leffler, S. and Karels, M., Trailer encapsulations, RFC 893, April 1984
[RFC894] Hornig, C., A Standard for the Transmission of IP Datagrams over Ethernet Networks, RFC 894, April
1984
[RFC896] Nagle, J., Congestion Control in IP/TCP Internetworks, RFC 896, Jan. 1984
[RFC952] Harrenstien, K. and Stahl, M. and Feinler, E., DoD Internet host table specification, RFC 952, Oct.
1985
[RFC959] Postel, J. and Reynolds, J., File Transfer Protocol, RFC 959, Oct. 1985
[RFC974] Partridge, C., Mail routing and the domain system, RFC 974, Jan. 1986
[RFC1032] Stahl, M., Domain administrators guide, RFC 1032, Nov. 1987
[RFC1035] Mockapteris, P., Domain names - implementation and specification, RFC 1035, Nov. 1987
[RFC1042] Postel, J. and Reynolds, J., Standard for the transmission of IP datagrams over IEEE 802 networks,
RFC 1042, Feb. 1988
[RFC1055] Romkey, J., Nonstandard for transmission of IP datagrams over serial lines: SLIP, RFC 1055, June
1988
[RFC1071] Braden, R., Borman D. and Partridge, C., Computing the Internet checksum, RFC 1071, Sep. 1988
[RFC1122] Braden, R., Requirements for Internet Hosts - Communication Layers, RFC 1122, Oct. 1989
[RFC1144] Jacobson, V., Compressing TCP/IP Headers for Low-Speed Serial Links, RFC 1144, Feb. 1990
[RFC1149] Waitzman, D., Standard for the transmission of IP datagrams on avian carriers, RFC 1149, Apr.
1990
[RFC1169] Cerf, V. and Mills, K., Explaining the role of GOSIP, RFC 1169, Aug. 1990
[RFC1191] Mogul, J. and Deering, S., Path MTU discovery, RFC 1191, Nov. 1990
[RFC1195] Callon, R., Use of OSI IS-IS for routing in TCP/IP and dual environments, RFC 1195, Dec. 1990
[RFC1258] Kantor, B., BSD Rlogin, RFC 1258, Sept. 1991
[RFC1321] Rivest, R., The MD5 Message-Digest Algorithm, RFC 1321, April 1992
Bibliography
243
[RFC1323] Jacobson, V., Braden R. and Borman, D., TCP Extensions for High Performance, RFC 1323, May
1992
[RFC1347] Callon, R., TCP and UDP with Bigger Addresses (TUBA), A Simple Proposal for Internet Addressing
and Routing, RFC 1347, June 1992
[RFC1518] Rekhter, Y. and Li, T., An Architecture for IP Address Allocation with CIDR, RFC 1518, Sept. 1993
[RFC1519] Fuller V., Li T., Yu J. and Varadhan, K., Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy, RFC 1519, Sept. 1993
[RFC1542] Wimer, W., Clarifications and Extensions for the Bootstrap Protocol, RFC 1542, Oct. 1993
[RFC1548] Simpson, W., The Point-to-Point Protocol (PPP), RFC 1548, Dec. 1993
[RFC1550] Bradner, S. and Mankin, A., IP: Next Generation (IPng) White Paper Solicitation, RFC 1550, Dec.
1993
[RFC1561] Piscitello, D., Use of ISO CLNP in TUBA Environments, RFC 1561, Dec. 1993
[RFC1621] Francis, P., PIP Near-term architecture, RFC 1621, May 1994
[RFC1624] Risjsighani, A., Computation of the Internet Checksum via Incremental Update, RFC 1624, May
1994
[RFC1631] Egevang K. and Francis, P., The IP Network Address Translator (NAT), RFC 1631, May 1994
[RFC1661] Simpson, W., The Point-to-Point Protocol (PPP), RFC 1661, Jul. 1994
[RFC1662] Simpson, W., PPP in HDLC-like Framing, RFC 1662, July 1994
[RFC1710] Hinden, R., Simple Internet Protocol Plus White Paper, RFC 1710, Oct. 1994
[RFC1738] Berners-Lee, T., Masinter, L., and McCahill M., Uniform Resource Locators (URL), RFC 1738, Dec.
1994
[RFC1752] Bradner, S. and Mankin, A., The Recommendation for the IP Next Generation Protocol, RFC 1752,
Jan. 1995
[RFC1812] Baker, F., Requirements for IP Version 4 Routers, RFC 1812, June 1995
[RFC1819] Delgrossi, L., Berger, L., Internet Stream Protocol Version 2 (ST2) Protocol Specification - Version
ST2+, RFC 1819, Aug. 1995
[RFC1889] Schulzrinne H., Casner S., Frederick, R. and Jacobson, V., RTP: A Transport Protocol for Real-Time
Applications, RFC 1889, Jan. 1996
[RFC1896] Resnick P., Walker A., The text/enriched MIME Content-type, RFC 1896, Feb. 1996
[RFC1918] Rekhter Y., Moskowitz B., Karrenberg D., de Groot G. and Lear, E., Address Allocation for Private
Internets, RFC 1918, Feb. 1996
[RFC1939] Myers, J. and Rose, M., Post Office Protocol - Version 3, RFC 1939, May 1996
[RFC1945] Berners-Lee, T., Fielding, R. and Frystyk, H., Hypertext Transfer Protocol  HTTP/1.0, RFC 1945,
May 1996
[RFC1948] Bellovin, S., Defending Against Sequence Number Attacks, RFC 1948, May 1996
[RFC1951] Deutsch, P., DEFLATE Compressed Data Format Specification version 1.3, RFC 1951, May 1996
[RFC1981] McCann, J., Deering, S. and Mogul, J., Path MTU Discovery for IP version 6, RFC 1981, Aug. 1996
[RFC2003] Perkins, C., IP Encapsulation within IP, RFC 2003, Oct. 1996
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and Romanow, A., TCP Selective Acknowledgment Options, RFC
2018, Oct. 1996
[RFC2045] Freed, N. and Borenstein, N., Multipurpose Internet Mail Extensions (MIME) Part One: Format of
Internet Message Bodies, RFC 2045, Nov. 1996
[RFC2046] Freed, N. and Borenstein, N., Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types,
RFC 2046, Nov. 1996
244
Bibliography
[RFC2050] Hubbard, K. and Kosters, M. and Conrad, D. and Karrenberg, D. and Postel, J., Internet Registry IP
Allocation Guidelines, RFC 2050, Nov. 1996
[RFC2080] Malkin, G. and Minnear, R., RIPng for IPv6, RFC 2080, Jan. 1997
[RFC2082] Baker, F. and Atkinson, R., RIP-2 MD5 Authentication, RFC 2082, Jan. 1997
[RFC2131] Droms, R., Dynamic Host Configuration Protocol, RFC 2131, March 1997
[RFC2140] Touch, J., TCP Control Block Interdependence, RFC 2140, April 1997
[RFC2225] Laubach, M., Halpern, J., Classical IP and ARP over ATM, RFC 2225, April 1998
[RFC2328] Moy, J., OSPF Version 2, RFC 2328, April 1998
[RFC2332] Luciani, J. and Katz, D. and Piscitello, D. and Cole, B. and Doraswamy, N., NBMA Next Hop Resolution Protocol (NHRP), RFC 2332, April 1998
[RFC2364] Gross, G. and Kaycee, M. and Li, A. and Malis, A. and Stephens, J., PPP Over AAL5, RFC 2364,
July 1998
[RFC2368] Hoffman, P. and Masinter, L. and Zawinski, J., The mailto URL scheme, RFC 2368, July 1998
[RFC2453] Malkin, G., RIP Version 2, RFC 2453, Nov. 1998
[RFC2460] Deering S., Hinden, R., Internet Protocol, Version 6 (IPv6) Specification, RFC 2460, Dec. 1998
[RFC2464] Crawford, M., Transmission of IPv6 Packets over Ethernet Networks, RFC 2464, Dec. 1998
[RFC2507] Degermark, M. and Nordgren, B. and Pink, S., IP Header Compression, RFC 2507, Feb. 1999
[RFC2516] Mamakos, L. and Lidl, K. and Evarts, J. and Carrel, J. and Simone, D. and Wheeler, R., A Method for
Transmitting PPP Over Ethernet (PPPoE), RFC 2516, Feb. 1999
[RFC2581] Allman, M. and Paxson, V. and Stevens, W., TCP Congestion Control, RFC 2581, April 1999
[RFC2616] Fielding, R. and Gettys, J. and Mogul, J. and Frystyk, H. and Masinter, L. and Leach, P. and BernersLee, T., Hypertext Transfer Protocol  HTTP/1.1, RFC 2616, June 1999
[RFC2617] Franks, J. and Hallam-Baker, P. and Hostetler, J. and Lawrence, S. and Leach, P. and Luotonen, A.
and Stewart, L., HTTP Authentication: Basic and Digest Access Authentication, RFC 2617, June 1999
[RFC2622] Alaettinoglu, C. and Villamizar, C. and Gerich, E. and Kessens, D. and Meyer, D. and Bates, T. and
Karrenberg, D. and Terpstra, M., Routing Policy Specification Language (RPSL), RFC 2622, June 1999
[RFC2675] Tsirtsis, G. and Srisuresh, P., Network Address Translation - Protocol Translation (NAT-PT), RFC
2766, Feb. 2000
[RFC2854] Connolly, D. and Masinter, L., The text/html Media Type, RFC 2854, June 2000
[RFC2965] Kristol, D. and Montulli, L., HTTP State Management Mechanism, RFC 2965, Oct. 2000
[RFC2988] Paxson, V. and Allman, M., Computing TCPs Retransmission Timer, RFC 2988, Nov. 2000
[RFC2991] Thaler, D. and Hopps, C., Multipath Issues in Unicast and Multicast Next-Hop Selection, RFC 2991,
Nov. 2000
[RFC3021] Retana, A. and White, R. and Fuller, V. and McPherson, D., Using 31-Bit Prefixes on IPv4 Point-toPoint Links, RFC 3021, Dec. 2000
[RFC3022] Srisuresh, P., Egevang, K., Traditional IP Network Address Translator (Traditional NAT), RFC 3022,
Jan. 2001
[RFC3031] Rosen, E. and Viswanathan, A. and Callon, R., Multiprotocol Label Switching Architecture, RFC
3031, Jan. 2001
[RFC3168] Ramakrishnan, K. and Floyd, S. and Black, D., The Addition of Explicit Congestion Notification
(ECN) to IP, RFC 3168, Sept. 2001
[RFC3243] Carpenter, B. and Brim, S., Middleboxes: Taxonomy and Issues, RFC 3234, Feb. 2002
[RFC3235] Senie, D., Network Address Translator (NAT)-Friendly Application Design Guidelines, RFC 3235,
Jan. 2002
Bibliography
245
[RFC3309] Stone, J. and Stewart, R. and Otis, D., Stream Control Transmission Protocol (SCTP) Checksum
Change, RFC 3309, Sept. 2002
[RFC3315] Droms, R. and Bound, J. and Volz, B. and Lemon, T. and Perkins, C. and Carney, M., Dynamic Host
Configuration Protocol for IPv6 (DHCPv6), RFC 3315, July 2003
[RFC3330] IANA, Special-Use IPv4 Addresses, RFC 3330, Sept. 2002
[RFC3360] Floyd, S., Inappropriate TCP Resets Considered Harmful, RFC 3360, Aug. 2002
[RFC3390] Allman, M. and Floyd, S. and Partridge, C., Increasing TCPs Initial Window, RFC 3390, Oct. 2002
[RFC3490] Faltstrom, P. and Hoffman, P. and Costello, A., Internationalizing Domain Names in Applications
(IDNA), RFC 3490, March 2003
[RFC3501] Crispin, M., Internet Message Access Protocol - Version 4 rev1, RFC 3501, March 2003
[RFC3513] Hinden, R. and Deering, S., Internet Protocol Version 6 (IPv6) Addressing Architecture, RFC 3513,
April 2003
[RFC3596] Thomson, S. and Huitema, C. and Ksinant, V. and Souissi, M., DNS Extensions to Support IP Version
6, RFC 3596, October 2003
[RFC3748] Aboba, B. and Blunk, L. and Vollbrecht, J. and Carlson, J. and Levkowetz, H., Extensible Authentication Protocol (EAP), RFC 3748, June 2004
[RFC3819] Karn, P. and Bormann, C. and Fairhurst, G. and Grossman, D. and Ludwig, R. and Mahdavi, J. and
Montenegro, G. and Touch, J. and Wood, L., Advice for Internet Subnetwork Designers, RFC 3819, July 2004
[RFC3828] Larzon, L-A. and Degermark, M. and Pink, S. and Jonsson, L-E. and Fairhurst, G., The Lightweight
User Datagram Protocol (UDP-Lite), RFC 3828, July 2004
[RFC3927] Cheshire, S. and Aboba, B. and Guttman, E., Dynamic Configuration of IPv4 Link-Local Addresses,
RFC 3927, May 2005
[RFC3931] Lau, J. and Townsley, M. and Goyret, I., Layer Two Tunneling Protocol - Version 3 (L2TPv3), RFC
3931, March 2005
[RFC3971] Arkko, J. and Kempf, J. and Zill, B. and Nikander, P., SEcure Neighbor Discovery (SEND), RFC
3971, March 2005
[RFC3972] Aura, T., Cryptographically Generated Addresses (CGA), RFC 3972, March 2005
[RFC3986] Berners-Lee, T. and Fielding, R. and Masinter, L., Uniform Resource Identifier (URI): Generic Syntax, RFC 3986, January 2005
[RFC4033] Arends, R. and Austein, R. and Larson, M. and Massey, D. and Rose, S., DNS Security Introduction
and Requirements, RFC 4033, March 2005
[RFC4193] Hinden, R. and Haberman, B., Unique Local IPv6 Unicast Addresses, RFC 4193, Oct. 2005
[RFC4251] Ylonen, T. and Lonvick, C., The Secure Shell (SSH) Protocol Architecture, RFC 4251, Jan. 2006
[RFC4264] Griffin, T. and Huston, G., BGP Wedgies, RFC 4264, Nov. 2005
[RFC4271] Rekhter, Y. and Li, T. and Hares, S., A Border Gateway Protocol 4 (BGP-4), RFC 4271, Jan. 2006
[RFC4291] Hinden, R. and Deering, S., IP Version 6 Addressing Architecture, RFC 4291, Feb. 2006
[RFC4301] Kent, S. and Seo, K., Security Architecture for the Internet Protocol, RFC 4301, Dec. 2005
[RFC4302] Kent, S., IP Authentication Header, RFC 4302, Dec. 2005
[RFC4303] Kent, S., IP Encapsulating Security Payload (ESP), RFC 4303, Dec. 2005
[RFC4340] Kohler, E. and Handley, M. and Floyd, S., Datagram Congestion Control Protocol (DCCP), RFC
4340, March 2006
[RFC4443] Conta, A. and Deering, S. and Gupta, M., Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification, RFC 4443, March 2006
246
Bibliography
[RFC4451] McPherson, D. and Gill, V., BGP MULTI_EXIT_DISC (MED) Considerations, RFC 4451, March
2006
[RFC4456] Bates, T. and Chen, E. and Chandra, R., BGP Route Reflection: An Alternative to Full Mesh Internal
BGP (IBGP), RFC 4456, April 2006
[RFC4614] Duke, M. and Braden, R. and Eddy, W. and Blanton, E., A Roadmap for Transmission Control Protocol (TCP) Specification Documents, RFC 4614, Oct. 2006
[RFC4648] Josefsson, S., The Base16, Base32, and Base64 Data Encodings, RFC 4648, Oct. 2006
[RFC4822] Atkinson, R. and Fanto, M., RIPv2 Cryptographic Authentication, RFC 4822, Feb. 2007
[RFC4838] Cerf, V. and Burleigh, S. and Hooke, A. and Torgerson, L. and Durst, R. and Scott, K. and Fall, K.
and Weiss, H., Delay-Tolerant Networking Architecture, RFC 4838, April 2007
[RFC4861] Narten, T. and Nordmark, E. and Simpson, W. and Soliman, H.,Neighbor Discovery for IP version 6
(IPv6), RFC 4861, Sept. 2007
[RFC4862] Thomson, S. and Narten, T. and Jinmei, T., IPv6 Stateless Address Autoconfiguration, RFC 4862,
Sept. 2007
[RFC4870] Delany, M., Domain-Based Email Authentication Using Public Keys Advertised in the DNS (DomainKeys), RFC 4870, May 2007
[RFC4871] Allman, E. and Callas, J. and Delany, M. and Libbey, M. and Fenton, J. and Thomas, M., DomainKeys
Identified Mail (DKIM) Signatures, RFC 4871, May 2007
[RFC4941] Narten, T. and Draves, R. and Krishnan, S., Privacy Extensions for Stateless Address Autoconfiguration in IPv6, RFC 4941, Sept. 2007
[RFC4944] Montenegro, G. and Kushalnagar, N. and Hui, J. and Culler, D., Transmission of IPv6 Packets over
IEEE 802.15.4 Networks, RFC 4944, Sept. 2007
[RFC4952] Klensin, J. and Ko, Y., Overview and Framework for Internationalized Email, RFC 4952, July 2007
[RFC4953] Touch, J., Defending TCP Against Spoofing Attacks, RFC 4953, July 2007
[RFC4954] Simeborski, R. and Melnikov, A., SMTP Service Extension for Authentication, RFC 4954, July 2007
[RFC4963] Heffner, J. and Mathis, M. and Chandler, B., IPv4 Reassembly Errors at High Data Rates, RFC 4963,
July 2007
[RFC4966] Aoun, C. and Davies, E., Reasons to Move the Network Address Translator - Protocol Translator
(NAT-PT) to Historic Status, RFC 4966, July 2007
[RFC4987] Eddy, W., TCP SYN Flooding Attacks and Common Mitigations, RFC 4987, Aug. 2007
[RFC5004] Chen, E. and Sangli, S., Avoid BGP Best Path Transitions from One External to Another, RFC 5004,
Sept. 2007
[RFC5065] Traina, P. and McPherson, D. and Scudder, J., Autonomous System Confederations for BGP, RFC
5065, Aug. 2007
[RFC5068] Hutzler, C. and Crocker, D. and Resnick, P. and Allman, E. and Finch, T., Email Submission Operations: Access and Accountability Requirements, RFC 5068, Nov. 2007
[RFC5072] Varada, S. and Haskins, D. and Allen, E., IP Version 6 over PPP, RFC 5072, Sept. 2007
[RFC5095] Abley, J. and Savola, P. and Neville-Neil, G., Deprecation of Type 0 Routing Headers in IPv6, RFC
5095, Dec. 2007
[RFC5227] Cheshire, S., IPv4 Address Conflict Detection, RFC 5227, July 2008
[RFC5234] Crocker, D. and Overell, P., Augmented BNF for Syntax Specifications: ABNF, RFC 5234, Jan. 2008
[RFC5321] Klensin, J., Simple Mail Transfer Protocol, RFC 5321, Oct. 2008
[RFC5322] Resnick, P., Internet Message Format, RFC 5322, Oct. 2008
[RFC5340] Coltun, R. and Ferguson, D. and Moy, J. and Lindem, A., OSPF for IPv6, RFC 5340, July 2008
Bibliography
247
[RFC5598] Crocker, D., Internet Mail Architecture, RFC 5598, July 2009
[RFC5646] Phillips, A. and Davis, M., Tags for Identifying Languages, RFC 5646, Sept. 2009
[RFC5681] Allman, M. and Paxson, V. and Blanton, E., TCP congestion control, RFC 5681, Sept. 2009
[RFC5735] Cotton, M. and Vegoda, L., Special Use IPv4 Addresses, RFC 5735, January 2010
[RFC5795] Sandlund, K. and Pelletier, G. and Jonsson, L-E., The RObust Header Compression (ROHC) Framework, RFC 5795, March 2010
[RFC6077] Papadimitriou, D. and Welzl, M. and Scharf, M. and Briscoe, B., Open Research Issues in Internet
Congestion Control, RFC 6077, February 2011
[RFC6068] Duerst, M., Masinter, L. and Zawinski, J., The mailto URI Scheme , RFC 6068, October 2010
[RFC6144] Baker, F. and Li, X. and Bao, X. and Yin, K., Framework for IPv4/IPv6 Translation, RFC 6144, April
2011
[RFC6265] Barth, A., HTTP State Management Mechanism, RFC 6265, April 2011
[RFC6274] Gont, F., Security Assessment of the Internet Protocol Version 4, RFC 6274, July 2011
[RG2010] Rhodes, B. and Goerzen, J., Foundations of Python Network Programming: The Comprehensive Guide
to Building Network Applications with Python, Second Edition, Academic Press, 2004
[RJ1995] Ramakrishnan, K. K. and Jain, R., A binary feedback scheme for congestion avoidance in computer
networks with a connectionless network layer. SIGCOMM Comput. Commun. Rev. 25, 1 (Jan. 1995), 138156.
[RIB2013] Raiciu, C., Iyengar, J., Bonaventure, O., Recent Advances in Reliable Transport Protocols, in H.
Haddadi, O. Bonaventure (Eds.), Recent Advances in Networking, (2013), pp. 59-106.
[RY1994] Ramakrishnan, K.K. and Henry Yang, The Ethernet Capture Effect: Analysis and Solution, Proceedings of IEEE 19th Conference on Local Computer Networks, MN, Oct. 1994.
[Roberts1975] Roberts, L., ALOHA packet system with and without slots and capture. SIGCOMM Comput.
Commun. Rev. 5, 2 (Apr. 1975), 28-42.
[Ross1989] Ross, F., An overview of FDDI: The fiber distributed data interface, IEEE J. Selected Areas in Comm.,
vol. 7, no. 7, pp. 1043-1051, Sept. 1989
[Russel06] Russell A., Rough Consensus and Running Code and the Internet-OSI Standards War, IEEE Annals
of the History of Computing, July-September 2006
[SAO1990] Sidhu, G., Andrews, R., Oppenheimer, A., Inside AppleTalk, Addison-Wesley, 1990
[SARK2002] Subramanian, L., Agarwal, S., Rexford, J., Katz, R.. Characterizing the Internet hierarchy from
multiple vantage points. In IEEE INFOCOM, 2002
[Sechrest] Sechrest, S., An Introductory 4.4BSD Interprocess Communication Tutorial, 4.4BSD Programmers
Supplementary Documentation
[SG1990] Scheifler, R., Gettys, J., X Window System: The Complete Reference to Xlib, X Protocol, ICCCM,
XLFD, X Version 11, Release 4, Digital Press
[SGP98] Stone, J., Greenwald, M., Partridge, C., and Hughes, J., Performance of checksums and CRCs over real
data. IEEE/ACM Trans. Netw. 6, 5 (Oct. 1998), 529-543.
[SH1980] Shoch, J. F. and Hupp, J. A., Measured performance of an Ethernet local network. Commun. ACM 23,
12 (Dec. 1980), 711-721.
[SH2004] Senapathi, S., Hernandez, R., Introduction to TCP Offload Engines, March 2004
[SMKKB2001] Stoica, I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H., Chord: A scalable peerto-peer lookup service for internet applications. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM 01). ACM, New York, NY,
USA, 149-160
[SMM1998] Semke, J., Mahdavi, J., and Mathis, M., Automatic TCP buffer tuning. SIGCOMM Comput. Commun. Rev. 28, 4 (Oct. 1998), 315-323.
248
Bibliography
[SPMR09] Stigge, M., Plotz, H., Muller, W., Redlich, J., Reversing CRC - Theory and Practice. Berlin: Humboldt
University Berlin. pp. 24.
[STBT2009] Sridharan, M., Tan, K., Bansal, D., Thaler, D., Compound TCP: A New TCP Congestion Control
for High-Speed and Long Distance Networks, Internet draft, work in progress, April 2009
[STD2013] Stewart, R., Tuexen, M., Dong, X., ECN for Stream Control Transmission Protocol (SCTP), Internet
draft, draft-stewart-tsvwg-sctpecn-04, April 2013, work in progress
[Seifert2008] Seifert, R., Edwards, J., The All-New Switch Book : The complete guide to LAN switching technology, Wiley, 2008
[Selinger] Selinger, P., MD5 collision demo, http://www.mscs.dal.ca/~selinger/md5collision/
[SFR2004] Stevens R. and Fenner, and Rudoff, A., UNIX Network Programming: The sockets networking API,
Addison Wesley, 2004
[Sklower89] Sklower, K. 1989. Improving the efficiency of the OSI checksum calculation. SIGCOMM Comput.
Commun. Rev. 19, 5 (Oct. 1989), 32-43.
[SMASU2012] Sarrar, N., Maier, G., Ager, B., Sommer, R. and Uhlig, S., Investigating IPv6 traffic, Passive and
Active Measurements, Lecture Notes in Computer Science vol 7192, 2012, pp.11-20
[SMM98] Semke, J., Mahdavi, J., and Mathis, M., Automatic TCP buffer tuning. SIGCOMM Comput. Commun.
Rev. 28, 4 (Oct. 1998), 315-323.
[Stevens1994] Stevens, R., TCP/IP Illustrated : the Protocols, Addison-Wesley, 1994
[Stevens1998] Stevens, R., UNIX Network Programming, Volume 1, Second Edition: Networking APIs: Sockets
and XTI, Prentice Hall, 1998
[Stewart1998] Stewart, J., BGP4: Inter-Domain Routing In The Internet, Addison-Wesley, 1998
[Stoll1988] Stoll, C., Stalking the wily hacker, Commun. ACM 31, 5 (May. 1988), 484-497.
[SV1995] 13. Shreedhar and G. Varghese. Efficient fair queueing using deficit round robin SIGCOMM Comput.
Commun. Rev. 25, 4 (October 1995), 231-242.
[TE1993] Tsuchiya, P. F. and Eng, T., Extending the IP internet through address reuse. SIGCOMM Comput.
Commun. Rev. 23, 1 (Jan. 1993), 16-33.
[Thomborson1992] Thomborson, C., The V.42bis Standard for Data-Compressing Modems, IEEE Micro,
September/October 1992 (vol. 12 no. 5), pp. 41-53
[Unicode] The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard,
Version 5.0 (Boston, MA, Addison-Wesley, 2007
[VPD2004] Vasseur, J., Pickavet, M., and Demeester, P., Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS. Morgan Kaufmann Publishers Inc., 2004
[Varghese2005] Varghese, G., Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices, Morgan Kaufmann, 2005
[Vyncke2007] Vyncke, E., Paggen, C., LAN Switch Security: What Hackers Know About Your Switches, Cisco
Press, 2007
[WB2008] Waserman, M., Baker, F., IPv6-to-IPv6 Network Address Translation (NAT66), Internet draft, November 2008, http://tools.ietf.org/html/draft-mrw-behave-nat66-02
[WMH2008] Wilson, P., Michaelson, G., Huston, G., Redesignation of 240/4 from Future Use to Private
Use, Internet draft, September 2008, work in progress, http://tools.ietf.org/html/draft-wilson-class-e-02
[WMS2004] White, R., Mc Pherson, D., Srihari, S., Practical BGP, Addison-Wesley, 2004
[Watson1981] Watson, R., Timer-Based Mechanisms in Reliable Transport Protocol Connection Management.
Computer Networks 5: 47-56 (1981)
[Williams1993] Williams, R. A painless guide to CRC error detection algorithms, August 1993, unpublished
manuscript, http://www.ross.net/crc/download/crc_v3.txt
[Winston2003] Winston, G., NetBIOS Specification, 2003
Bibliography
249
[WY2011] Wing, D. and Yourtchenko, A., Happy Eyeballs: Success with Dual-Stack Hosts, Internet draft, work
in progress, July 2011, http://tools.ietf.org/html/draft-ietf-v6ops-happy-eyeballs-03
[X200] ITU-T, recommendation X.200, Open Systems Interconnection - Model and Notation, 1994
[X224] ITU-T, recommendation X.224, Information technology - Open Systems Interconnection - Protocol for
providing the connection-mode transport service, 1995
[XNS] Xerox, Xerox Network Systems Architecture, XNSG058504, 1985
[Zimmermann80] Zimmermann, H., OSI Reference Model - The ISO Model of Architecture for Open Systems
Interconnection, IEEE Transactions on Communications, vol. 28, no. 4, April 1980, pp. 425 - 432.
250
Bibliography
Index
Symbols
::, 173
::1, 173
100BaseTX, 215
10Base2, 215
10Base5, 214
10BaseT, 215
802.11 frame format, 225
802.5 data frame, 96
802.5 token frame, 95
B
Base64 encoding, 120
Basic Service Set (BSS), 224
beacon frame (802.11), 227
BGP, 199, 231
BGP Adj-RIB-In, 202
BGP Adj-RIB-Out, 202
BGP decision process, 205
BGP KEEPALIVE, 201
BGP local-preference, 205
BGP nexthop, 204
BGP NOTIFICATION, 201
BGP OPEN, 201
BGP peer, 200
D
data plane, 29
Datalink layer, 9, 106
delayed acknowledgements, 151
251
E
EAP, 211
eBGP, 232
EGP, 232
EIFS, 91
EIGRP, 232
electrical cable, 5
email message format, 116
Ending Delimiter (Token Ring), 95
Ethernet bridge, 216
Ethernet DIX frame format, 212
Ethernet hub, 215
Ethernet switch, 216
Ethernet Type field, 212
EtherType, 212
exponential backoff, 151
export policy, 199
Extended Inter Frame Space, 91
Extensible Authentication Protocol, 211
F
Fairness, 98
Fast Ethernet, 215
FDM, 83
FECN, 81
Five layers reference model, 106
Forward Explicit Congestion Notification, 81
forwarding loop, 28
forwarding table, 28
frame, 9, 106, 232
Frame-Relay, 232
framing, 9
Frequency Division Multiplexing, 83
FTP, 232
ftp, 232
HTTP, 232
hub, 232
I
IANA, 232
iBGP, 232
ICANN, 232
IETF, 232
IGP, 232
IGRP, 232
IMAP, 232
import policy, 199
independent network, 224
infrastructure network, 224
interdomain routing policy, 199
Internet, 232
internet, 232
inverse query, 232
IP, 232
IPv4, 233
IPv4 fragmentation and reassembly, 178
IPv6, 233
IPv6 fragmentation, 177
IPv6 Renumbering, 189
IS-IS, 233
ISN, 233
ISO, 233
ISO-3166, 233
ISP, 233
ITU, 233
IXP, 233
J
jamming, 87
jumbogram, 177
L
label switching, 40
LAN, 233
large window, 148
leased line, 233
Link Local address, 173
link-local IPv6 address, 186
link-state routing, 47
LLC, 214
Logical Link Control (LLC), 214
go-back-n, 19
graceful connection release, 59, 70
H
head-of-line blocking, 156
Hello message, 48
hidden station problem, 93
hop-by-hop forwarding, 28
hosts.txt, 71, 232
HTML, 232
252
Index
N
Nagle algorithm, 147
nameserver, 233
naming, 70
NAT, 233
NBMA, 43, 168, 233
NDP, 187
Neighbor Discovery Protocol, 187
Neighbor Solicitation message, 187
Neighbour Discovery Protocol, 186
network congestion, 78
Network Information Center, 71
Network layer, 107
network-byte order, 234
NFS, 234
Non-Broadcast Multi-Access Networks, 43, 168
non-persistent CSMA, 86
NTP, 234
O
Open Shortest Path First, 193
optical fiber, 5
Organisation Unique Identifier, 212
OSI, 234
OSI reference model, 108
OSPF, 193, 234
OSPF area, 193
OSPF Designated Router, 195
OUI, 212
P
packet, 107, 234
packet discard mechanism, 80
packet radio, 84
packet size distribution, 148
Path MTU discovery, 184
PBL, 234
peer-to-peer, 55
persistence timer, 25, 68
persistent CSMA, 86
Physical layer, 8, 106
physical layer, 106
piggybacking, 25
ping6, 184
Point-to-Point Protocol, 210
POP, 234
Index
port-address table, 30
portmapper, 137
Post Office Protocol, 124
PPP, 210
Provider Aggregatable address, 171
Provider Independent address, 171
R
record route, 36
Reference models, 105
reliable connectionless service, 55
Remote Procedure Call, 61
Request To Send, 93
request-response service, 60
resolver, 234
RFC
RFC 1032, 72, 243
RFC 1035, 72, 113, 114, 231, 243
RFC 1042, 227, 243
RFC 1055, 210, 243
RFC 1071, 139, 243
RFC 1094, 234
RFC 1122, 108, 140, 142, 146, 152, 212, 243
RFC 1144, 210, 243
RFC 1149, 70, 243
RFC 1169, 243
RFC 1191, 243
RFC 1195, 193, 243
RFC 1258, 142, 243
RFC 1305, 234
RFC 1321, 243
RFC 1323, 146, 148150, 154, 244
RFC 1347, 169, 170, 244
RFC 1518, 231, 244
RFC 1519, 172, 244
RFC 1542, 244
RFC 1548, 210, 244
RFC 1550, 169, 244
RFC 1561, 169, 244
RFC 1621, 169, 244
RFC 1624, 244
RFC 1631, 169, 244
RFC 1661, 26, 168, 244
RFC 1662, 210, 244
RFC 1710, 169, 170, 175, 244
RFC 1738, 127, 133, 244
RFC 1752, 169, 244
RFC 1812, 101, 244
RFC 1819, 169, 244
RFC 1831, 134, 137
RFC 1832, 134, 135
RFC 1833, 137
RFC 1889, 244
RFC 1896, 119, 244
RFC 1918, 173, 244
RFC 1939, 124, 125, 234, 244
RFC 1945, 129, 130, 244
RFC 1948, 143, 244
253
254
Index
S
scheduler, 81
scheduling algorithm, 81
SCTP, 156
SCTP chunk, 157
SCTP common header, 157
SCTP CWR chunk, 165
SCTP data chunk, 159
SCTP ECN Echo chunk, 165
Index
T
TCB, 234
TCP, 139, 234
TCP Connection establishment, 141
TCP connection release, 154
TCP fast retransmit, 152
TCP header, 140
TCP Initial Sequence Number, 142
TCP MSS, 145
TCP Options, 145
TCP RST, 143
TCP SACK, 153
TCP selective acknowledgements, 153
TCP self clocking, 97
TCP SYN, 141
TCP SYN+ACK, 141
TCP/IP, 235
TCP/IP reference model, 108
255
telnet, 235
Tier-1 ISP, 209
Time Division Multiplexing, 84
time-sequence diagram, 6
TLD, 235
TLS, 235
Token Holding Time, 96
Token Ring data frame, 96
Token Ring Monitor, 95
Token Ring token frame, 95
traceroute6, 184
transit domain, 197
Transmission Control Block, 147
Transmission Sequence Number, 159
transport clock, 63
Transport layer, 107
two-way connectivity, 51
U
UDP, 137, 235
UDP Checksum, 139
UDP segment, 138
unicast, 235
Unique Local Unicast IPv6, 173
unreliable connectionless service, 55
V
Virtual LAN, 222
VLAN, 222
vnc, 235
W
W3C, 235
WAN, 235
Wavelength Division Multiplexing, 84
WDM, 84
WiFi, 223
X
X.25, 235
X11, 235
XDR, 134
XML, 235
256
Index