What Unicode Means (or, why the Content-Type header is important)

Back in the day, when all computer engineers spoke English and lived in the USA, there was a wonderful standard called ASCII for representing text. It mapped each element of the English alphabet, digits, and punctuation (as well as some special symbols) to numbers between 0 and 127. For example, the letter "a" was 97. Since a byte can hold any of 256 values, this meant that text could be encoded using just one byte per character. In fact, since a byte can represent any of 256 values, there were even plenty of unused codes left over.

Computer manufacturers all had their own ideas about what meaning to assign to the unused codes. They would represent accented Latin characters if you bought a computer in Western Europe, Hebrew characters if you bought a computer in Israel, etc. For people writing in East Asian languages, there were even more complicated schemes which required using two bytes for each character, because those languages have more than 128 additional characters that need to be represented (thousands, in fact).

Eventually ANSI standardized these extensions to ASCII by creating the concept of "code pages": each code page was an encoding that mapped some set of characters to the codes between 128 and 255, although the codes from 0 to 127 were all the same (that is, the same as ASCII). If you are using Windows in any Western European language, you may have noticed your computer mentioning that it is using the Windows-1252 encoding, which is one such code page (it includes, mostly, accented Latin characters like è).

Just because the code pages were standardized did not mean that they were not a mess. Sharing files with others was problematic, because your messages would appear garbled if the recipient used a different encoding while opening the file. (This is what you are observing whenever you get emails which contain question marks or rectangles or characters like çëêèîôö until you tell your email program or web browser to use a different encoding.) The problem is that the meaning of a sequence of bytes is ambiguous unless you know what mapping between characters and bytes the sender used. Even worse, you could only use one code page at a time: for example, a document could not contain both Hebrew and Cyrillic letters; a database could not contain text from languages requiring two different code pages unless it stored the associated code page alongside each chunk of text.

Unicode was designed to solve this last problem by creating a single character set that could represent characters from any language. In the Unicode standard, each character (of which about 100,000 have been standardized now) is mapped to a "code point", which is just a number. For example, code point 65 is the letter "A" and code point 33865 is the Chinese character 葉 (usually, code points are represented in hex so people would typically refer to this character as U+8449).

Notice that we've only been talking about characters and their corresponding integers (code points): we've said nothing about how these integers are to be represented. As a first attempt you might choose to represent the code points using fixed-length sequences of bytes: representing code points 0, 1, 2, etc., using 0x00000000, 0x00000001, 0x00000002, etc., respectively (With three bytes you can actually represent up to 16,777,216 characters, which is plenty for now, but four is a nice round number.) This encoding (way of mapping code points to byte sequences) is called UTF-32. It's very simple, but people who were using mostly characters found in ASCII had to pay a 4x space overhead to encode all their data in UTF-32. So it was really hard to sell them on UTF-32, which, consequently, is rarely used.

There is another encoding called UTF-8 which maps code points to sequences of varying lengths: some code points can be written using just one byte, and some code points need up to four bytes to write. UTF-8 has the interesting property that Unicode code points 0-127, which are the same as ASCII characters, are mapped to the one-byte sequences 0x00-0x7f (0-127). What this means is that every ASCII text means exactly the same thing when interpreted as UTF-8! It is because of this backwards-compatibility that UTF-8 has become the standard for representing text on the web, in Java, and as a part of many other standards. There is another encoding called UTF-16 which uses two or four bytes to represent each character (Windows uses it a lot internally). The UTF-* encodings can represent any character in Unicode, which is a superset of pretty much every other character set out there. So text in any encoding scheme can be reversibly converted to UTF-8 or UTF-16.

To understand Unicode, it's important to make the distinction between a sequence of Unicode code points (abstractly, characters) and an encoding of those characters, which is just a sequence of bytes. With that in mind we can start to understand some of the mysteries of text.

Unicode in Python

[In Python 3.x, sequences of Unicode code points are represented by str, the default string type. So you don't need to use the u prefix on string literals as shown below. String data that has been encoded, and thus no longer carries the implication that it is a sequence of Unicode code points, is represented using bytes (as is data that is just random binary data). More information about str and bytes.]

In Python 2.x, "sequences of Unicode code points" are represented by unicode objects; you can make Unicode literals like so:

z = u"\u0041\u0042C" # Equivalent to u"ABC"

To do any I/O on a Unicode string (printing it, writing it to a file, or sending it out over the network), you have to choose an encoding for it. The process of encoding always transforms your unicode string into a str string, which is often assumed to mean "ASCII text" but really just means "sequence of bytes in some encoding".

Obviously, the encoding you choose has to be the same as the encoding that will be used to open it (that is, the encoding used by your terminal, the program that will open the file, or the computer on the other end of the network connection). When it's necessary to encode down to a byte sequence, Python will silently use your default encoding, which is usually ASCII. If you are using ASCII characters, you probably won't notice anything funny going on. But if you use any international (non-ASCII) characters, Python will raise a UnicodeEncodeError, which basically means "I tried to do the encoding, but one of the characters doesn't have an equivalent in ASCII!". There are two ways around this, which involve explicitly specifying the encoding using the encode method:

  1. If you intend for your recipient to be able to read the non-ASCII characters, you need to choose an encoding that can actually represent them. What you probably want here is to encode to UTF-8:

    >>> u"\u0041\u0042\u8449".encode('utf-8')
    'AB\xe8\x91\x89'

    Observe that the last character was encoded using three bytes.

  2. If your recipient really is expecting ASCII text, then they just can't view international characters. You can tell Python to attempt to do the encoding but replace the characters with question marks when the target character set can't represent the character:

    >>> u"\u0041\u0042\u8449".encode('ascii', 'replace')
    'AB?'

As you can see, converting to ASCII (and other 1-byte encodings) is lossy, but converting to and from UTF-8, UTF-32, etc. are reversible.

Unicode in HTML documents

Data served over HTTP also has to have an associated encoding to be interpreted unambiguously. Usually the Content-Type header is used to identify the encoding for the document. The W3C's HTML validator will yell at you if you do not specify the encoding:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

But, you may have noticed a weird circularity here... how can a client read any data on the page, much less the encoding, if it doesn't already know what the encoding is in which it should interpret the contents of the page? Well, you're supposed to put the Content-Type header as the first thing after <head>, and you can usually get that far without using any non-ASCII characters, and ASCII, UTF-8, and any ANSI code page will all agree on the meaning of the ASCII characters (code points 0-127). So it works.

In XML, you are supposed to identify the charset on the first line, in the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

Unicode in text files

Even text files are not immune. In fact, when you have multiple encodings, there is no longer any such thing as "plain text" (by which we mean, text data that can be read unambiguously).

In Emacs, to set the coding system for a file, set the coding variable by adding this magic incantation anywhere in the first line (or second line, if the first line is a shebang line):

-*- coding: utf-8 -*-

Further reading: Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively, Must Know About Unicode and Character Sets, Unicode FAQ, Python: Overview of Encodings and Unicode, Emacs manual: (info "(emacs)Specifying File Variables")

2 comments:

  1. Nice Example. But for Beginners its hard to Understand What it means? So try to give some clear instructions.... Thnx

    ReplyDelete
  2. Thanks - a great description and introduction.

    ReplyDelete