HTML Encoding (Character Sets)
From ASCII to UTF-8
ASCII was the first character encoding standard. ASCII defined 128 different
characters that could be used on the internet: numbers (0-9), English letters
(A-Z), and some special characters like ! $ + - ( ) @ < > .
ISO-8859-1 was the default character set for HTML 4. This character set
supported 256 different character codes. HTML 4 also supported UTF-8.
ANSI (Windows-1252) was the original Windows character set. ANSI is
identical to ISO-8859-1, except that ANSI has 32 extra characters.
The HTML5 specification encourages web developers to use the UTF-8
character set, which covers almost all of the characters and symbols in the
world!
The HTML charset Attribute
To display an HTML page correctly, a web browser must know the character
set used in the page.
This is specified in the <meta> tag:
<meta charset="UTF-8">
The ASCII Character Set
ASCII uses the values from 0 to 31 (and 127) for control characters.
ASCII uses the values from 32 to 126 for letters, digits, and symbols.
ASCII does not use the values from 128 to 255.
The ANSI Character Set (Windows-
1252)
ANSI is identical to ASCII for the values from 0 to 127.
ANSI has a proprietary set of characters for the values from 128 to 159.
ANSI is identical to UTF-8 for the values from 160 to 255.
The ISO-8859-1 Character Set
ISO-8859-1 is identical to ASCII for the values from 0 to 127.
ISO-8859-1 does not use the values from 128 to 159.
ISO-8859-1 is identical to UTF-8 for the values from 160 to 255.
The UTF-8 Character Set
UTF-8 is identical to ASCII for the values from 0 to 127.
UTF-8 does not use the values from 128 to 159.
UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255.
UTF-8 continues from the value 256 with more than 10 000 different
characters.
HTML Uniform Resource
Locators
A URL is another word for a web address.
A URL can be composed of words (e.g. w3schools.com), or an Internet
Protocol (IP) address (e.g. 192.68.20.50).
Most people enter the name when surfing, because names are easier to
remember than numbers.
URL - Uniform Resource Locator
Web browsers request pages from web servers by using a URL.
A Uniform Resource Locator (URL) is used to address a document (or other
data) on the web.
A web address like https://www.w3schools.com/html/default.asp follows
these syntax rules:
scheme://prefix.domain:port/path/filename
Explanation:
• scheme - defines the type of Internet service (most common is http
or https)
• prefix - defines a domain prefix (default for http is www)
• domain - defines the Internet domain name (like w3schools.com)
• port - defines the port number at the host (default for http is 80)
• path - defines a path at the server (If omitted: the root directory of
the site)
• filename - defines the name of a document or resource
Common URL Schemes
The table below lists some common schemes:
Scheme Short for Used for
http HyperText Transfer Protocol Common web pages. Not
encrypted
https Secure HyperText Transfer Protocol Secure web pages.
Encrypted
ftp File Transfer Protocol Downloading or
uploading files
file A file on your computer
URL Encoding
URLs can only be sent over the Internet using the ASCII character-set. If a
URL contains characters outside the ASCII set, the URL has to be converted.
URL encoding converts non-ASCII characters into a format that can be
transmitted over the Internet.
URL encoding replaces non-ASCII characters with a "%" followed by
hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a
plus (+) sign, or %20.
Try It Yourself
Hello Günter Submit
If you click "Submit", the browser will URL encode the input before it is sent
to the server.
A page at the server will display the received input.
Try some other input and click Submit again.
ASCII Encoding Examples
Your browser will encode input, according to the character-set used in your
page.
The default character-set in HTML5 is UTF-8.
Character From Windows-1252 From UTF-8
€ %80 %E2%82%AC
£ %A3 %C2%A3
© %A9 %C2%A9
® %AE %C2%AE
À %C0 %C3%80
Á %C1 %C3%81
 %C2 %C3%82
à %C3 %C3%83
Ä %C4 %C3%84
Å %C5 %C3%85
HTML Versus XHTML
XHTML is a stricter, more XML-based version of HTML.
What is XHTML?
• XHTML stands for EXtensible HyperText Markup Language
• XHTML is a stricter, more XML-based version of HTML
• XHTML is HTML defined as an XML application
• XHTML is supported by all major browsers
Why XHTML?
XML is a markup language where all documents must be marked up correctly
(be "well-formed").
XHTML was developed to make HTML more extensible and flexible to work
with other data formats (such as XML). In addition, browsers ignore errors in
HTML pages, and try to display the website even if it has some errors in the
markup. So XHTML comes with a much stricter error handling.
The Most Important Differences from
HTML
• <!DOCTYPE> is mandatory
• The xmlns attribute in <html> is mandatory
• <html>, <head>, <title>, and <body> are mandatory
• Elements must always be properly nested
• Elements must always be closed
• Elements must always be in lowercase
• Attribute names must always be in lowercase
• Attribute values must always be quoted
• Attribute minimization is forbidden
XHTML - <!DOCTYPE ....> Is Mandatory
An XHTML document must have an XHTML <!DOCTYPE> declaration.
The <html>, <head>, <title>, and <body> elements must also be present,
and the xmlns attribute in <html> must specify the xml namespace for the
document.
Example
Here is an XHTML document with a minimum of required tags:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Title of document</title>
</head>
<body>
some content here...
</body>
</html>
HTML Elements Must be Properly
Nested
In XHTML, elements must always be properly nested within each other, like
this:
Correct:
<b><i>Some text</i></b>
Wrong:
<b><i>Some text</b></i>
XHTML Elements Must Always be
Closed
In XHTML, elements must always be closed, like this:
Correct:
<p>This is a paragraph</p>
<p>This is another paragraph</p>
Wrong:
<p>This is a paragraph
<p>This is another paragraph
XHTML Empty Elements Must Always
be Closed
In XHTML, empty elements must always be closed, like this:
Correct:
A break: <br />
A horizontal rule: <hr />
An image: <img src="happy.gif" alt="Happy face" />
Wrong:
A break: <br>
A horizontal rule: <hr>
An image: <img src="happy.gif" alt="Happy face">
XHTML Elements Must be in Lowercase
In XHTML, element names must always be in lowercase, like this:
Correct:
<body>
<p>This is a paragraph</p>
</body>
Wrong:
<BODY>
<P>This is a paragraph</P>
</BODY>
XHTML Attribute Names Must be in
Lowercase
In XHTML, attribute names must always be in lowercase, like this:
Correct:
<a href="https://www.w3schools.com/html/">Visit our HTML tutorial</a>
Wrong:
<a HREF="https://www.w3schools.com/html/">Visit our HTML tutorial</a>
XHTML Attribute Values Must be
Quoted
In XHTML, attribute values must always be quoted, like this:
Correct:
<a href="https://www.w3schools.com/html/">Visit our HTML tutorial</a>
Wrong:
<a href=https://www.w3schools.com/html/>Visit our HTML tutorial</a>
XHTML Attribute Minimization is
Forbidden
In XHTML, attribute minimization is forbidden:
Correct:
<input type="checkbox" name="vehicle" value="car" checked="checked" />
<input type="text" name="lastname" disabled="disabled" />
Wrong:
<input type="checkbox" name="vehicle" value="car" checked />
<input type="text" name="lastname" disabled />