UTF-8 (OMUTF8)

The UTF-8 encoding library is used to process data files that contain UTF-8 encoding.

The function utf8.char matches UTF-8 characters in the data. The function utf8.single-byte-char matches only ASCII UTF-8 characters, whereas the function utf8.multi-byte-char matches double-byte UTF-8 characters.

The function utf8.code-point is used to convert a UTF-8 character (that is, a sequence of bytes that represents a character in UTF-8) to its binary character value, while the function utf8.encoding converts a binary character value to UTF-8 (that is, to that sequence of bytes which represents that character value in UTF-8).

Example:


  import "omutf8.xmd" prefixed by utf8.
  
  process
     repeat scan "flamb%195#%169#"
     match utf8.single-byte-char+ => c
        output c
  
     match utf8.multi-byte-char => c
        local integer n initial { utf8.code-point of c }
  
        do when n > 255
           output "&#x" || "16rud" % n || ";"
  
        else
           output "b" % n
        done
     again