KEMBAR78
Understand unicode & utf8 in perl (2) | PDF
Understand Unicode &
UTF8 in Perl
avoid common issues and gain guru status.
         (You too can be John)
Characters and Glyphs
A character: 'é'

Combination of 2 glyphs:

e (LATIN SMALL LETTER E)

Followed by:

´ (ACUTE ACCENT)
Characters and Glyphs
A character: 'é'

Or a combined glyph:

é (LATIN SMALL LETTER E WITH ACUTE)
So what is Unicode (in this
context)?
A collection of glyphs (mainly) called
Codepoints with a unique number and a set of
properties.
Example: E ( U+0045 )
  Name         LATIN CAPITAL
               LETTER E


  Block        Basic Latin


  Category     Letter, Uppercase [Lu]


  Combine      0


  BIDI         BIDI


  Lower case   U+0065
What is a String?
An ordered collection of glyphs i.e. an ordered
collection of Unicode point.
In Perl:

my $s = "he";
or
my $s = "N{U+0068}N{U+0065}";
What is a String ? - The glyph Pitfall
An ordered collection of glyphs. There's more
that one way to write it.
In Perl:
my $s = "é"
is
my $s = "N{U+00E9}"; OR..
my $s = "N{U+0065}N{U+00B4}";

In practice, software prefer the first way (pffui),
but not always. See Unicode::Normalize
How does Perl represent Strings?
Short answer: It's not your business.

Long answer: It depends :(

Only "latin1 characters" -> Latin1. Anything
outside that -> UTF-8.

Feeling fiddly, bug fixing? use utf8::* function.
Bedtime read: perldoc perlunicode
Not my business? So what's this
fuss about UTF-8 encoding?
How strings are represented internally is not
your business.
How they are transmitted from/to the outside
world is.
The outside world doesn't understand 'Strings'.
It understands 'bytes'.

An encoding is a bijection:
Unicode Points (glyphs) <-> bytes
UTF-8 encoding
Unicode Points (glyphs) <-> bytes

Variable number of bytes per unicode point.
Examples:

a <-> x{61} ,
☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)

Sometimes, the bytes begin with a BOM.
The encoding law
Never transfer Strings. Always transfer Bytes.

But inside Perl: You want to work with Strings
as much as possible.

Sending: Encode as LATE as possible.

Receiving: Decode as EARLY as possible.
Common outside worlds: STDOUT
Latin1 encoding by default :(
-> You can only output 'Latin1 compliant
Strings'. And your shell should expect Latin1.

In the modern world:
# Set STDOUT to encode as UTF8
binmode STDOUT , ':utf8';
Common outside worlds: A text file
if you know the file encoding:
 open(my $fh, "<:encoding(UTF-8)",
"filename");

if you don't know.
Maybe you can count on the BOM byte.

But you don't want that. You want to know for
sure -> set a convention.
Common outside worlds: XML file
Encoding specified in the preamble:
<?xml version="1.0" encoding="utf-8"?>
If not specified -> utf8 is assumed.

Feed your XML parser with BYTES.
Write XML files in binary mode.

XML::LibXML:: Calls bytes 'Strings'.. People
are confused. Trust no one.
Common outside worlds: WWW
From a given page, browsers send parameters
in the encoding of the page.

Correctly encode your binary responses.

Decode $c->params()

In Catalyst:
Catalyst::Plugin::Unicode::Encoding
Common outside worlds: Your own
Every time you communicate with a system,
you will send/receive bytes. Never strings.

Think about encoding/decoding your strings
to/from bytes, according to what your system
expects/provides.

Sometime, it's done automagically through
some library options.
Bug avoiding guidelines.
Test everything with Unicode characters.

English keyboard? chartables.de, unicode
lorem ipsum.

Unit test => "N{U+262D}"

Never i/o strings. Never. i/o is about bytes.
Choose encodings explicitly.
Bonus: Escaping
What if you want to represent your nice shiny
UTF8 bytes as part of something else?

You need to escape them!

Example in URI, escaping parameters:
(URI::Escape):

http://foo.com/?q=%E2%98%AD
Bonus: Escaping for email headers
Encode AND Escape for Email subjects
(Encode with MIME-Q):

Encode::encode('MIME-Q', "aN{U+262D}c");
=?UTF-8?Q?a=E2=98=ADb?=

It encodes and escapes at the same time.
Beware of confusion.

Keep string for as long as you can.
Conclusion
Make sure you make a difference Strings and
Bytes. In Perl, it must come from discipline.

Make sure you always encode/decode on i/o as
explicitly as possible. Don't let confused others
confuse you.

Always wonder: What does this thing operates
on. Bytes or Strings? In doubt, investigate.

Understand unicode & utf8 in perl (2)

  • 1.
    Understand Unicode & UTF8in Perl avoid common issues and gain guru status. (You too can be John)
  • 2.
    Characters and Glyphs Acharacter: 'é' Combination of 2 glyphs: e (LATIN SMALL LETTER E) Followed by: ´ (ACUTE ACCENT)
  • 3.
    Characters and Glyphs Acharacter: 'é' Or a combined glyph: é (LATIN SMALL LETTER E WITH ACUTE)
  • 4.
    So what isUnicode (in this context)? A collection of glyphs (mainly) called Codepoints with a unique number and a set of properties. Example: E ( U+0045 ) Name LATIN CAPITAL LETTER E Block Basic Latin Category Letter, Uppercase [Lu] Combine 0 BIDI BIDI Lower case U+0065
  • 5.
    What is aString? An ordered collection of glyphs i.e. an ordered collection of Unicode point. In Perl: my $s = "he"; or my $s = "N{U+0068}N{U+0065}";
  • 6.
    What is aString ? - The glyph Pitfall An ordered collection of glyphs. There's more that one way to write it. In Perl: my $s = "é" is my $s = "N{U+00E9}"; OR.. my $s = "N{U+0065}N{U+00B4}"; In practice, software prefer the first way (pffui), but not always. See Unicode::Normalize
  • 7.
    How does Perlrepresent Strings? Short answer: It's not your business. Long answer: It depends :( Only "latin1 characters" -> Latin1. Anything outside that -> UTF-8. Feeling fiddly, bug fixing? use utf8::* function. Bedtime read: perldoc perlunicode
  • 8.
    Not my business?So what's this fuss about UTF-8 encoding? How strings are represented internally is not your business. How they are transmitted from/to the outside world is. The outside world doesn't understand 'Strings'. It understands 'bytes'. An encoding is a bijection: Unicode Points (glyphs) <-> bytes
  • 9.
    UTF-8 encoding Unicode Points(glyphs) <-> bytes Variable number of bytes per unicode point. Examples: a <-> x{61} , ☭ <-> x{E2}x{98}x{AD} (gdrive FAIL) Sometimes, the bytes begin with a BOM.
  • 10.
    The encoding law Nevertransfer Strings. Always transfer Bytes. But inside Perl: You want to work with Strings as much as possible. Sending: Encode as LATE as possible. Receiving: Decode as EARLY as possible.
  • 11.
    Common outside worlds:STDOUT Latin1 encoding by default :( -> You can only output 'Latin1 compliant Strings'. And your shell should expect Latin1. In the modern world: # Set STDOUT to encode as UTF8 binmode STDOUT , ':utf8';
  • 12.
    Common outside worlds:A text file if you know the file encoding: open(my $fh, "<:encoding(UTF-8)", "filename"); if you don't know. Maybe you can count on the BOM byte. But you don't want that. You want to know for sure -> set a convention.
  • 13.
    Common outside worlds:XML file Encoding specified in the preamble: <?xml version="1.0" encoding="utf-8"?> If not specified -> utf8 is assumed. Feed your XML parser with BYTES. Write XML files in binary mode. XML::LibXML:: Calls bytes 'Strings'.. People are confused. Trust no one.
  • 14.
    Common outside worlds:WWW From a given page, browsers send parameters in the encoding of the page. Correctly encode your binary responses. Decode $c->params() In Catalyst: Catalyst::Plugin::Unicode::Encoding
  • 15.
    Common outside worlds:Your own Every time you communicate with a system, you will send/receive bytes. Never strings. Think about encoding/decoding your strings to/from bytes, according to what your system expects/provides. Sometime, it's done automagically through some library options.
  • 16.
    Bug avoiding guidelines. Testeverything with Unicode characters. English keyboard? chartables.de, unicode lorem ipsum. Unit test => "N{U+262D}" Never i/o strings. Never. i/o is about bytes. Choose encodings explicitly.
  • 17.
    Bonus: Escaping What ifyou want to represent your nice shiny UTF8 bytes as part of something else? You need to escape them! Example in URI, escaping parameters: (URI::Escape): http://foo.com/?q=%E2%98%AD
  • 18.
    Bonus: Escaping foremail headers Encode AND Escape for Email subjects (Encode with MIME-Q): Encode::encode('MIME-Q', "aN{U+262D}c"); =?UTF-8?Q?a=E2=98=ADb?= It encodes and escapes at the same time. Beware of confusion. Keep string for as long as you can.
  • 19.
    Conclusion Make sure youmake a difference Strings and Bytes. In Perl, it must come from discipline. Make sure you always encode/decode on i/o as explicitly as possible. Don't let confused others confuse you. Always wonder: What does this thing operates on. Bytes or Strings? In doubt, investigate.