Character set encoding

In the days when ASCII ruled supreme, programmers (in North America at least) didn't have to worry much about character set encoding. But with the advent of Unicode, character set encoding has become an important issue for anyone doing text processing.

First of all, what is a character set? Simply, a character set is a table in which a character or glyph (the physical representation of a letter, digit, or piece of punctuation) is assigned a character number. The character number is then used to represent the character in the computer.

ASCII is the most prevalent character set on the planet, but it is limited to 128 character codes (0 to 127) and only includes those characters used in written English. It does not include accented characters used in many western European languages, the Cyrillic alphabet used in eastern Europe, the Greek alphabet, or the Japanese or Chinese alphabets.

Several character sets extend ASCII by another 128 character numbers (128-255), each one with a different set of characters corresponding to the same 128 character numbers. The most prevalent of these is Latin 1, which is the default on Windows and most UNIX systems.

Since many character sets use character codes 128-255, this creates the obvious problem that you have to know which character set was intended if you want to correctly view documents created using those character sets. Some document formats, such as HTML, have mechanisms for encoding in the document itself which character set is being used, but there is no general or universal means of recording this information. You just have to know.

To solve this Tower of Babel in character sets we now have Unicode, which incorporates all the alphabets of the world into a single character set.

Character numbers and character set encodings

Previous character sets diverged because of the need to fit a large number of characters into the small space provided by the 256 numbers that could be represented by a single byte of computer memory. Unicode abandons the 256 character number restriction and allows for 65536 character numbers, the number that can be represented by 2 bytes. (Actually, the number is higher because of special mechanisms used to extend the range to allow for some less frequently used characters, but basic Unicode is a 2 byte—16 bit—character set.)

Unicode characters are identical to ASCII characters for character numbers 0 through 127. This is good, because it means that the millions of ASCII documents out in the world are also valid Unicode documents. The problem is that those ASCII documents are represented by computer files that use exactly one byte to represent each ASCII character. Unicode documents can have character numbers up to 65536, which requires two bytes. So how is a Unicode application to recognize an ASCII file as an ASCII file with one-byte characters instead of as a Unicode file with two-byte characters?

The answer to this problem lies in how character codes are represented as bytes or sequences of bytes in computer memory. In ASCII this is very simple. A character number is encoded as a single byte and the binary value of that byte corresponds to the character number. This system is so simple and so prevalent that we may have come to think of the character number and its byte representation as the same thing. They are not.

For a start, consider the string Mary. In ASCII, the character numbers for this string are: 77 97 114 121. In a computer file they would be represented by bytes with the corresponding values:

  77 97 114 121

In Unicode, the string Mary has exactly the same character numbers: 77 97 114 121. In a computer file, assuming that we use two bytes per character, they would be represented as bytes with the following values:

  0 77 0 97 0 114 0 121

It requires two different interpretations of these byte sequences that makes them both represent the character numbers 77 97 114 121. The convention that interprets the second set of bytes as 2-byte Unicode characters is called UCS-2. Obviously, UCS-2 is incompatible with ASCII, even though the character numbers are the same. An ASCII-based program won't interpret the UCS-2 encoding properly, nor will a UCS-2-based program interpret the ASCII encoding properly.

There is, however, an encoding scheme for Unicode characters that is compatible with ASCII. It is called UTF-8. Under UTF-8, the byte encoding of character numbers between 0 and 127 is the binary value of a single byte, just like ASCII. For character numbers between 128 and 65535, however, multiple bytes are used. If the first byte has a value between 128 and 255, it is interpreted to indicate the number of bytes that are to follow. The bytes following encode a single character number. All the bytes following in the character encoding also have values between 128 and 255, so there will never be any confusion between single-byte characters between 0 and 127, and bytes that are part of multi-byte character representations.

For example, the character é which has a character number of 233 in both Latin 1 and Unicode, is represented by a single byte with a value 233 in conventional Latin 1 encoding, but is represented as two bytes with the values 195 and 169 in UTF-8.

While an ASCII application will not recognize and understand the multi-byte characters in a Unicode file, if the Unicode file happens to contain only characters in the ASCII range (as it often will) then an ASCII-based application will work just fine. Similarly, a UTF-8-based application will work fine with all ASCII files, since they are also valid UTF-8 files.

OmniMark and character encodings

You have two important character-encoding issues to consider in OmniMark.

The first issue is that an OmniMark program is itself a text document. The OmniMark compiler is the program that reads and interprets this text file. What character encodings does the compiler support? What character encodings can I use to write my OmniMark programs?

The second issue is how does OmniMark handle character encodings in input and how does it encode output? Can I use OmniMark to process UTF-8 encoded data?

The answer to both questions is that OmniMark does not depend on character encodings at all. Both the OmniMark compiler and the OmniMark language itself operate at the byte level. They make no automatic interpretation of bytes into character numbers.

As far as the compiler is concerned, the consequences of this are minor but useful. OmniMark names allow a limited set of characters (in fact, byte values) between 0 and 127, and all characters (byte values) between 128 and 255. This means that all the bytes in a UTF-8 multi-byte character are legitimate name bytes. This in turn means that you can write an OmniMark program in Unicode using UTF-8 encoding and everything will work fine. The compiler is not doing any recognition of these bytes as Unicode characters. It is dealing with them strictly as bytes. This means that a UCS-2 encoding of the same Unicode program would not work. (It is very easy to write an OmniMark program to translate from UCS-2 to UTF-8 and back again using the UTF-8 Encoding library, so this should not present a problem.)

As far as handling input (and output) data is concerned, things are very simple, but you may need to think about a few issues.

OmniMark works as a byte processor, not a character number processor. When you put the letter a in a pattern in an OmniMark program, it is in fact an instruction to match a byte with the value 97. That value is taken from the byte encoding of the OmniMark program itself. This means that if you write your program in a UTF-8 enabled editor and write a pattern that looks for a Unicode character, it will show up on your screen as a single character but will be represented in the program file as a sequence of bytes and, as a pattern, it will match that sequence of bytes. As long as the editor is UTF-8 enabled and the data file is UTF-8 encoded, everything works just as it did in the ASCII world. (Note, however, that including a multi-byte character in a character class will not work. The character class will consist of the individual byte numbers, not the single multi-byte character. There is no way to include a multi-byte character in a character set.)

The combination of ASCII editor and ASCII data works fine. So does the combination of UTF-8 editor and UTF-8 data, as well as UTF-8 editor and ASCII data. Issues arise, however, when you use an ASCII editor (such as the OmniMark Studio for Eclipse) to write a program to process or create UTF-8 encoded data. Since XML is a Unicode-based language, this is an issue for many programmers.

How do you represent a UTF-8 byte encoding of a Unicode character using an ASCII editor? How do you decode a UTF-8 byte sequence into its Unicode character code? What happens to Unicode characters in your XML document? When the parser sees a numeric character entity greater than 127 in an XML or SGML file, how does it encode the character number in the output?

The answer to the first two questions is to make use of the utf8.char pattern and the utf8.code-point and utf8.encoding functions in the UTF-8 Encoding library.

The handling of Unicode characters in an XML file is transparent as long as the XML file is encoded in UTF-8. Unicode characters will be passed to the OmniMark program as UTF-8 byte sequences, and output as such.

If you want to output characters in some other encoding, you will need to capture them in a data-content rule and use the UTF-8 Encoding library to decode the character numbers and output them in the encoding of your choice. The following example translates from UTF-8 encoded Latin 1 characters to conventional one-byte Latin 1 encoding. This would be appropriate in translating from XML to a Windows or UNIX text file:

  import "omutf8.xmd" prefixed by utf8.
  
  declare catch too-big (value string c)
  
  
  data-content
     local integer character-code
     repeat scan "%c"
     match utf8.char => c
        set character-code to utf8.code-point of c
        throw too-big (c) 
           when character-code > 255
        output "b" % character-code
     again

Note that the parser does not guarantee that all XML characters will be output in UTF-8. It guarantees that XML characters will be output the same way they were input. It is up to you to determine the encoding of your input file. For programmers dealing with English language XML files, this will not usually be an issue, since ASCII and UTF-8 encodings are identical in the range used to write English. Programmers who use UTF-8 all the time will also have little difficulty. The real issue comes with XML files that use Latin 1 characters, but no characters with numbers over 255. These could be encoded in either single-byte encoding or UTF-8. It will be your responsibility to determine the encoding of your input data and to ensure you create the correct encoding of your output data.

The with utf-8 modifier can be applied to the do xml-parse and do sgml-parse actions to change how numeric character entities are handled by the parser. The default is with utf-8 true for do xml-parse and with utf-8 false for do sgml-parse.

Note that the XML parser can handle XML files with ASCII encoding (character numbers 0 to 127 encoded as a single byte), Latin 1 encoding (character numbers 0 to 255 encoded as a single byte), and UTF-8 encoding (character numbers 0 to 65535 encoded as a variable number of bytes), but will not work with other encodings of Unicode characters.

Related Topics