UTF-8 (OMUTF8)

This library is used to process data files that contain UTF-8 encoded data.

The function utf8.char matches UTF-8 characters in the data. The function utf8.single-byte-char matches only ASCII UTF-8 characters, whereas the function utf8.multi-byte-char matches double-byte UTF-8 characters.

The function utf8.code-point is used to convert a UTF-8 character (that is, a sequence of bytes that represents a character in UTF-8) to its binary character value, while the function utf8.encoding converts a binary character value to UTF-8 (that is, to that sequence of bytes which represents that character value in UTF-8).

Example

The following program shows how utf8.single-byte-char and utf8.multi-byte-char can be used to pattern-match UTF-8 encoded data, and how utf8.code-point can be used to convert the captured bytes to their binary value.

  import "omutf8.xmd" prefixed by utf8.
  
  process
     repeat scan "flamb%195#%169#"
     match utf8.single-byte-char+ => c
        output c
  
     match utf8.multi-byte-char => c
        local integer n initial { utf8.code-point of c }
  
        do when n > 255
           output "&#x" || "16rud" % n || ";"
  
        else
           output "b" % n
        done
     again