Prerequisite Concepts

Character classes

In a pattern, you can specify literal text (or expressions that resolve to literal text) or you can specify a character class. A character class is a set of characters. A character in the input data will match a character class if it matches any one of the characters in the character class.

For example, the OmniMark built-in character class digit includes the characters "0", "1", "2", "3", "4", "5", "6", "7", "8", and "9". Given the input data "123ABC", the following pattern will match "1":

  find digit

And the following pattern will match "123":

  find digit+

OmniMark provides the following predefined character classes:

letter -- matches a single letter character, uppercase or lowercase
uc -- matches a single uppercased letter
lc -- matches a single lowercased letter
digit -- matches a single digit (0-9)
space -- matches a single space character
blank -- matches a single space or tab character
white-space -- matches a single space, tab, or newline character
any-text -- matches any single character except for a newline
any -- matches any single character

Since the predefined character classes may not always meet your needs, OmniMark lets you define your own character classes. A programmer-defined character class is contained between square brackets. For example, the following pattern matches an arithmetic operator:

  find ["+-*/"]

This character class consists of any of the characters in the string "+-*/". If your character class will contain many characters, you can include every character except those you specify by preceding the string of characters with the "except" operator \. For example, the following pattern matches any character except the XML markup characters "<", "&", and ">":

  find [\"<&>"]

You can also specify a character set by adding or subtracting characters from a built-in character set. To add characters, you join character classes and strings with the or operator |. For example, the following pattern matches any hexadecimal digit:

  find [digit | "AaBcCcDdEeFf"]

To subtract characters, you use the "except" operator "\". For example, the following pattern matches any octal digit:

  find [digit \ "89"]

You can also use the "or" operator to join two or more built-in character classes, as in this pattern that matches any alpha-numeric character:

  find [letter | digit]

Note that while you can use the "or" operator as many times as you like, you can only use the except operator once in a character class. Thus this pattern is not valid:

  find [letter \ "xyz" | digit \ "7"]

You must rewrite it as follows:

  find [letter | digit \ "xyz7"]

You can also specify ranges of characters using to. For example, the following code fragment matches any character between the lowercase letters "a" and "m":

  find ["a" to "m"]

You can combine ranges or exclude them from other things in a character set, including other ranges. For example, the following pattern matches any character between the lowercase letters "a" and "z" as well as the characters ".", ",", or "?"; it does not match the lowercase letters between "i" and "n" or the lowercase letter "t":

  find ["a" to "z" | ".,?" \ "i" to "n" | "t"]

Take care when using character set ranges because the letters of the alphabet are not always contiguous in a character set. In the EBCDIC character encoding, for example, there are non-alphabetic characters between "A" and "Z".

Don't confuse a character class with a pattern. If you want to match any number of characters up to the first colon you can write either:

  find [\ ":"]*

  find any** lookahead ":"

But if you need to match any number of characters up to a multi-character delimiter such as "</price>", you must write:

  find any** lookahead "</price>"

and not

  find [\ "</price>"]*

The latter will match any number of characters up to the first "<", "/", "p", "r", "i", "c", "e", or ">" character, not any number of characters up to the string "</price>".

Deprecated syntax

The word except is a deprecated synonym for the "except" operator \.

In previous versions of OmniMark, the keyword any was required before the "except" operator in creating an "any except" character class. Thus the character class [\ "aeiou"] would be written [any except "aeiou"]. The form [any \ "aeiou"] is still permitted and is identical in meaning to [\ "aeiou"].

Prerequisite Concepts
Pattern matching

[ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ LEGACY LIBRARIES ] [ ERRORS ]

OmniMark 8.2.0 Documentation Generated: March 13, 2008 at 3:25:49 pm
If you have any comments about this section of the documentation, please use this form.