UTF-32 (OMFFUTF32)

The OMFFUTF32 library provides functions for converting a UTF-32 encoded stream to one encoded using UTF-8, and back. Details of the UTF-32 encoding of Unicode can be found in Sections 3.9 and 3.10 of the Unicode Standard. The UTF-32 standard specifies two byte orderings:

  • UTF-32BE, and
  • UTF-32LE.

The OMFFUTF32 library supports both of these orderings, as well as two others, which correspond to individual bytes being swapped within each half of a single value:

  • UTF-32BE(2143), and
  • UTF-32LE(3412).

Use of these two orderings could lead to outputs which do not conform to the UTF-32 specification.

The following example takes a UTF-32 encoded stream and converts it to UTF-8 before streaming it to the XML parser for further processing. It then converts the results to UTF-32LE for output.

  import "omffutf32.xmd" prefixed by utf32.
  
  
  process
     using output as utf32.writer in utf32.encoding-utf-32le into #main-output
     do xml-parse scan utf32.reader in utf32.encoding-utf-32be from #main-input
        output "%c"
     done
  
  ; ...
          

As the example demonstrates, the conversions performed by OMFFUTF32 are configured by specifying a constant to the utf32.reader and utf32.writer functions. One constant is defined for each of the orderings supported:

  • encoding-utf-32be: UTF-32BE,
  • encoding-utf-32le: UTF-32LE,
  • encoding-utf-32be-2143: UTF-32BE(2143), and
  • encoding-utf-32le-3412: UTF-32LE(3412).

The utf32.reader function can determine the byte ordering from the input:

  • if an ordering is provided to the utf32.reader function, it is used, otherwise
  • if the input being read by the utf32.reader function begins with a byte-order-mark, the byte-order-mark is used to determine the byte ordering, otherwise
  • the input is assumed to be UTF-32BE.

This conforms to Paragraph D101 of the UTF-32 specification. In addition, utf32.reader will always discard a byte-order-mark that appears at the beginning of its input if it is the byte-order-mark corresponding to the encoding being processed, regardless of how the encoding was selected.

The OMFFUTF32 library exports a shelf of constants that corresponds to the byte-order-marks of all supported orderings:

  export constant string byte-order-marks initial { "%16r{00,00,FE,FF}" with key "UTF-32BE",        ; 1234
                                                    "%16r{00,00,FF,FE}" with key "UTF-32BE(2143)",  ; 2143
                                                    "%16r{FE,FF,00,00}" with key "UTF-32LE(3412)",  ; 3412
                                                    "%16r{FF,FE,00,00}" with key "UTF-32LE" }       ; 4321
          

The first and last of these correspond to the byte-order-marks defined by the Unicode Consortium; the other two are the logical extensions to the two additional byte orderings supported by OMFFUTF32.

If at any time either of the functions utf32.reader or utf32.writer encounters an invalid input, the catch utf32.invalid-input is thrown:

  export catch invalid-input
          

For example utf32.reader will throw utf32.invalid-input if the length of the input is not a multiple of four, since this cannot represent a valid UTF-32 encoded stream. Similarly, utf32.writer will throw utf32.invalid-input if the input is not valid UTF-8.