UTF-16 (OMFFUTF16)

This library contains one OmniMark external source function and one OmniMark external output function implementation, as follows:

reader is an external source function that reads a value string source, its argument, and returns the text of that file converted from a UTF-16 encoding to a UTF-8 encoding. That is, the provided source is in UTF-16, but the program sees UTF-8.

Any malformed input data is read as a Unicode NOT-A-CHARACTER character (0xFFFD). The only malformed case recognized is if only half of a surrogate pair is found.

Read-in UTF-16 data is assumed by default to be big-endian, but leading and embedded Byte Order Marks (BOM) in the data are recognized and acted upon. A leading BOM is removed from the input, but embedded ones are left in.

writer is an external output function that accepts UTF-8 encoded data and writes that data to a value string sink, its first argument, converted from a UTF-8 encoding to a UTF-16 encoding. That is, the program writes UTF-8, but the provided output receives UTF-16.

writer has two further switch-valued arguments, placed ahead of the output argument. true is used as a default value in both cases. The two arguments are:

bom: true if a Byte Order Mark (BOM) is to be written as the first character in the output. false if not.
big-endian: true if the output is to be written big-endian. This is an appropriate default, especially in combination with a BOM, because big-endian is the Internet standard. false for little-endian.

Any malformed output data is written as a Unicode NOT-A-CHARACTER character (0xFFFD). The only malformed cases recognized are characters too large to be encodable as UTF-16 (i.e. larger than 0xFFFF), and characters whose UTF-16 encodings would be the value of half of a surrogate pair.

A good place to find information on the details of UTF-16 encoding is: <http://www.unicode.org/unicode/faq/utf_bom.html>

Usage Note

To use OMFFUTF16, you must import it into your program using a statement like this:

  import "omffutf16.xmd" prefixed by utf16.

(Please see the import topic for more on importing.)

Functions