The OMFFUTF32 library provides functions for converting a UTF-32 encoded stream to one encoded using UTF-8, and back. Details of the UTF-32 encoding of Unicode can be found in Sections 3.9 and 3.10 of the Unicode Standard. The UTF-32 standard specifies two byte orderings:
The OMFFUTF32 library supports both of these orderings, as well as two others, which correspond to individual bytes being swapped within each half of a single value:
Use of these two orderings could lead to outputs which do not conform to the UTF-32 specification.
The following example takes a UTF-32 encoded stream and converts it to UTF-8 before streaming it to the XML
parser for further processing. It then converts the results to UTF-32LE for output.
import "omffutf32.xmd" prefixed by utf32. process using output as utf32.writer in utf32.encoding-utf-32le into #main-output do xml-parse scan utf32.reader in utf32.encoding-utf-32be from #main-input output "%c" done ; ...
As the example demonstrates, the conversions performed by OMFFUTF32 are configured by specifying a constant to
the utf32.reader
and utf32.writer
functions. One constant is defined for each of the
orderings supported:
encoding-utf-32be
: UTF-32BE,
encoding-utf-32le
: UTF-32LE,
encoding-utf-32be-2143
: UTF-32BE(2143), and
encoding-utf-32le-3412
: UTF-32LE(3412).
The utf32.reader
function can determine the byte ordering from the input:
utf32.reader
function, it is used, otherwise
utf32.reader
function begins with a byte-order-mark, the
byte-order-mark is used to determine the byte ordering, otherwise
This conforms to Paragraph D101 of the UTF-32 specification. In addition, utf32.reader
will always
discard a byte-order-mark that appears at the beginning of its input if it is the byte-order-mark corresponding
to the encoding being processed, regardless of how the encoding was selected.
The OMFFUTF32 library exports a
shelf of constants that corresponds to the byte-order-marks of all supported orderings:
export constant string byte-order-marks initial { "%16r{00,00,FE,FF}" with key "UTF-32BE", ; 1234 "%16r{00,00,FF,FE}" with key "UTF-32BE(2143)", ; 2143 "%16r{FE,FF,00,00}" with key "UTF-32LE(3412)", ; 3412 "%16r{FF,FE,00,00}" with key "UTF-32LE" } ; 4321
The first and last of these correspond to the byte-order-marks defined by the Unicode Consortium; the other two are the logical extensions to the two additional byte orderings supported by OMFFUTF32.
If at any time either of the functions utf32.reader
or utf32.writer
encounters an invalid
input, the catch
utf32.invalid-input
is thrown:
export catch invalid-input
For example utf32.reader
will throw utf32.invalid-input
if the length of the input is not a
multiple of four, since this cannot represent a valid UTF-32 encoded stream. Similarly, utf32.writer
will throw utf32.invalid-input
if the input is not valid UTF-8.