Understand unicode & utf8 in perl (2)

Understand Unicode &
UTF8 in Perl
avoid common issues and gain guru status.
(You too can be John)

Characters and Glyphs
A character: 'é'

Combination of 2 glyphs:

e (LATIN SMALL LETTER E)

Followed by:

´ (ACUTE ACCENT)

Characters and Glyphs
A character: 'é'

Or a combined glyph:

é (LATIN SMALL LETTER E WITH ACUTE)

So what is Unicode (in this
context)?
A collection of glyphs (mainly) called
Codepoints with a unique number and a set of
properties.
Example: E ( U+0045 )
Name LATIN CAPITAL
LETTER E

Block Basic Latin

Category Letter, Uppercase [Lu]

Combine 0

BIDI BIDI

Lower case U+0065

What is a String?
An ordered collection of glyphs i.e. an ordered
collection of Unicode point.
In Perl:

my $s = "he";
or
my $s = "N{U+0068}N{U+0065}";

What is a String ? - The glyph Pitfall
An ordered collection of glyphs. There's more
that one way to write it.
In Perl:
my $s = "é"
is
my $s = "N{U+00E9}"; OR..
my $s = "N{U+0065}N{U+00B4}";

In practice, software prefer the first way (pffui),
but not always. See Unicode::Normalize

How does Perl represent Strings?
Short answer: It's not your business.

Long answer: It depends :(

Only "latin1 characters" -> Latin1. Anything
outside that -> UTF-8.

Feeling fiddly, bug fixing? use utf8::* function.
Bedtime read: perldoc perlunicode

Not my business? So what's this
fuss about UTF-8 encoding?
How strings are represented internally is not
your business.
How they are transmitted from/to the outside
world is.
The outside world doesn't understand 'Strings'.
It understands 'bytes'.

An encoding is a bijection:
Unicode Points (glyphs) <-> bytes

UTF-8 encoding
Unicode Points (glyphs) <-> bytes

Variable number of bytes per unicode point.
Examples:

a <-> x{61} ,
☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)

Sometimes, the bytes begin with a BOM.

The encoding law
Never transfer Strings. Always transfer Bytes.

But inside Perl: You want to work with Strings
as much as possible.

Sending: Encode as LATE as possible.

Receiving: Decode as EARLY as possible.

Common outside worlds: STDOUT
Latin1 encoding by default :(
-> You can only output 'Latin1 compliant
Strings'. And your shell should expect Latin1.

In the modern world:
# Set STDOUT to encode as UTF8
binmode STDOUT , ':utf8';

Common outside worlds: A text file
if you know the file encoding:
open(my $fh, "<:encoding(UTF-8)",
"filename");

if you don't know.
Maybe you can count on the BOM byte.

But you don't want that. You want to know for
sure -> set a convention.

Common outside worlds: XML file
Encoding specified in the preamble:
<?xml version="1.0" encoding="utf-8"?>
If not specified -> utf8 is assumed.

Feed your XML parser with BYTES.
Write XML files in binary mode.

XML::LibXML:: Calls bytes 'Strings'.. People
are confused. Trust no one.

Common outside worlds: WWW
From a given page, browsers send parameters
in the encoding of the page.

Correctly encode your binary responses.

Decode $c->params()

In Catalyst:
Catalyst::Plugin::Unicode::Encoding

Common outside worlds: Your own
Every time you communicate with a system,
you will send/receive bytes. Never strings.

Think about encoding/decoding your strings
to/from bytes, according to what your system
expects/provides.

Sometime, it's done automagically through
some library options.

Bug avoiding guidelines.
Test everything with Unicode characters.

English keyboard? chartables.de, unicode
lorem ipsum.

Unit test => "N{U+262D}"

Never i/o strings. Never. i/o is about bytes.
Choose encodings explicitly.

Bonus: Escaping
What if you want to represent your nice shiny
UTF8 bytes as part of something else?

You need to escape them!

Example in URI, escaping parameters:
(URI::Escape):

http://foo.com/?q=%E2%98%AD

Bonus: Escaping for email headers
Encode AND Escape for Email subjects
(Encode with MIME-Q):

Encode::encode('MIME-Q', "aN{U+262D}c");
=?UTF-8?Q?a=E2=98=ADb?=

It encodes and escapes at the same time.
Beware of confusion.

Keep string for as long as you can.

Conclusion
Make sure you make a difference Strings and
Bytes. In Perl, it must come from discipline.

Make sure you always encode/decode on i/o as
explicitly as possible. Don't let confused others
confuse you.

Always wonder: What does this thing operates
on. Bytes or Strings? In doubt, investigate.

Understand unicode & utf8 in perl (2)

More Related Content

What's hot

Viewers also liked

Similar to Understand unicode & utf8 in perl (2)

Recently uploaded

Understand unicode & utf8 in perl (2)