ASCII sent to another device, like a monitor or
printer, that device reads the binary sequence,
ASCII is a fundamental character encoding converts it back to its ASCII number, and
standard used in data and digital displays or prints the correct character.
communication. It stands for American
Standard Code for Information Interchange. Its ASCII in Digital Communication
primary purpose is to create a universal
language for computers by assigning a unique ASCII became the backbone of early digital
numerical value to each character, such as communication protocols because of its
letters, numbers, and symbols. This simplicity and efficiency. Its standardized nature
standardization ensures that different allowed for interoperability, meaning
computing devices can consistently interpret computers from different manufacturers could
and exchange text-based information. "talk" to each other without misinterpreting the
data. This was crucial for the development of
How ASCII Works early text-based applications like email and the
command-line interface.
At its core, ASCII uses a 7-bit binary code to
represent 128 different characters (2^7 = 128). Since it uses only 7 bits, an ASCII character can
This set includes: easily fit into a standard 8-bit byte, with the
eighth bit often used for error checking (a
95 printable characters: This group covers all "parity bit") or to create an "Extended ASCII"
uppercase and lowercase English letters (A-Z, a- table. The latter expanded the character set to
z), digits (0-9), punctuation marks, and 256 characters to include symbols for other
mathematical symbols. For example, the languages and special characters, though this
uppercase letter 'A' is represented by the led to some compatibility issues as different
decimal value 65, which is 01000001 in binary. companies created their own extended
The number '1' is represented by the decimal versions.
value 49, or 00110001 in binary.
ASCII vs. Unicode
33 non-printable control characters: These
characters aren't meant to be displayed. While ASCII was revolutionary, its limitation to
Instead, they serve as commands for devices. primarily English characters became a major
For instance, the "carriage return" (CR) problem as digital communication went global.
command tells a printer to move the print head It simply couldn't represent the characters and
back to the beginning of the line, and the "line symbols of languages like Chinese, Arabic, or
feed" (LF) command tells it to advance the Russian.
paper to the next line.
This is where Unicode came in. Unicode is a
When you type a letter on a keyboard, the modern, universal character encoding standard
computer translates that character into its that can represent virtually every character in
corresponding ASCII code (a number). This every language. It assigns a unique number,
number is then converted into a binary called a "code point," to each character. The
sequence of 0s and 1s, which is the native first 128 Unicode code points are identical to
language of the computer. When this data is the ASCII set, making Unicode backward-
compatible with ASCII. Unicode's most popular Multi-byte characters: For all other characters
encoding scheme, UTF-8, is now the dominant in the Unicode standard, such as those from
character encoding on the internet, but the other languages (like Cyrillic, Chinese, Arabic) or
foundational principles established by ASCII special symbols (emojis, mathematical
remain a vital part of computing history and are symbols), UTF-8 uses two, three, or four bytes.
still used in specific low-level applications and The first byte of a multi-byte character always
protocols. starts with a sequence of 1s followed by a 0,
indicating the total number of bytes in the
sequence. For example, a two-byte character
UTF 8 starts with 110, a three-byte character with
1110, and a four-byte character with 11110.
UTF-8 is a variable-length character encoding The subsequent bytes of the sequence always
standard that is the dominant encoding on the start with 10.
web and in digital communication. It's a key
part of Unicode, a universal standard that This design has several advantages:
assigns a unique number to virtually every Efficiency: For text that is predominantly
character in every language. UTF-8 provides a English, UTF-8 is just as compact as ASCII
way to represent these Unicode characters in a because it uses only a single byte per character.
way that is backward-compatible with older This saves storage space and bandwidth.
systems and efficient for a wide range of
languages. Universality: It can encode every character in
the Unicode standard, allowing for global
How UTF-8 Works communication and data exchange without the
The "8" in UTF-8 stands for 8-bit, referring to limitations of older, language-specific
the smallest unit of data it uses. However, encodings.
unlike older single-byte encodings like ASCII, Self-synchronizing: The structure of the leading
UTF-8 can use between one and four bytes to bytes (the 110, 1110, etc., and the subsequent
represent a character. This variable-length 10s) allows a program to easily find the
design is what makes it so versatile and beginning of the next character, even if a byte is
efficient. corrupted or a connection is lost.
Single-byte characters: For the first 128 UTF-8 in Data and Digital Communication
characters, which correspond to the entire ASCII
character set (English letters, numbers, and UTF-8 is now the de facto standard for
basic punctuation), UTF-8 uses just a single character encoding on the internet. Web
byte. The first bit is always a 0, and the browsers, email clients, and operating systems
remaining seven bits are used for the rely on it to correctly display text from all over
character's code, making it fully backward- the world. When you send a text message with
compatible with ASCII. This is why a simple text an emoji or visit a website with content in
file with only English characters is identical multiple languages, UTF-8 is the underlying
whether it's saved as ASCII or UTF-8. technology that ensures the characters are
displayed correctly.
Its flexibility and comprehensive character first unit is a "high surrogate" and the second is
support make it an essential component of a "low surrogate." These two units together
modern data and digital communication, form a 32-bit value that points to a single
replacing the fragmented and limited character character. This system allows UTF-16 to
sets of the past with a single, universal represent all of the over one million possible
standard. Unicode code points.
UTF-16 in Digital Communication and Data
UTF 16 While UTF-8 is more space-efficient for text
primarily in English, UTF-16 can be more
UTF-16, or 16-bit Unicode Transformation compact for languages that use a lot of
Format, is a variable-length character encoding characters in the BMP, such as many East Asian
standard for Unicode. While UTF-8 is the languages. This is because a single character in
dominant encoding on the web, UTF-16 is these languages may be represented by a two-
widely used internally by many operating byte code in UTF-16, but could require three
systems and programming languages, such as bytes in UTF-8.
the Windows API and Java.
A notable difference from UTF-8 is that UTF-16
How UTF-16 Works is not backward-compatible with ASCII. An ASCII
UTF-16's core principle is that it uses 16-bit character in UTF-16 is represented with a
code units (or 2-byte blocks) to represent leading byte of all zeros, whereas in UTF-8 it's a
characters. It's a variable-length encoding single byte identical to the ASCII value. This
because it can use either one or two of these makes UTF-8 the preferred choice for web-
code units to represent a single Unicode based communication where ASCII compatibility
character. and efficiency for English text are crucial.
Single-unit characters: For the most common Due to its fixed 16-bit base, UTF-16 is simpler to
characters, those in the first 65,536 code points process for many applications that don't need
of Unicode (known as the Basic Multilingual to deal with a variable number of bytes for
Plane, or BMP), UTF-16 uses a single 16-bit code most common characters. However, it also
unit. This group includes Latin, Greek, Cyrillic, introduces challenges with endianness, which
and most of the East Asian characters. For refers to the byte order of the 16-bit units. To
example, the character A (Unicode code point address this, UTF-16 files often include a Byte
U+0041) is represented simply as the 16-bit Order Mark (BOM) at the beginning to signal
value 0x0041. how the bytes are arranged.
Surrogate pairs: For the less common
characters, such as emojis and rare CJK
(Chinese, Japanese, Korean) characters, which
are outside the BMP, UTF-16 uses a special
mechanism called a surrogate pair. A surrogate
pair is a sequence of two 16-bit code units. The