|
|||||
|
|||||
Prerequisite Concepts | |||||
Character classes |
In a pattern, you can specify literal text (or expressions that resolve to literal text) or you can specify a character class. A character class is a set of characters. A character in the input data will match a character class if it matches any one of the characters in the character class.
For example, the OmniMark built-in character class digit
includes the characters "0", "1", "2", "3", "4", "5", "6", "7", "8", and "9". Given the input data "123ABC", the following pattern will match "1":
find digit
And the following pattern will match "123":
find digit+
OmniMark provides the following predefined character classes:
letter
-- matches a single letter character, uppercase or lowercase
uc
-- matches a single uppercased letter
lc
-- matches a single lowercased letter
digit
-- matches a single digit (0-9)
space
-- matches a single space character
blank
-- matches a single space or tab character
white-space
-- matches a single space, tab, or newline
character
any-text
-- matches any single character except for a
newline
any
-- matches any single character
Since the predefined character classes may not always meet your needs, OmniMark lets you define your own character classes. A programmer-defined character class is contained between square brackets. For example, the following pattern matches an arithmetic operator:
find ["+-*/"]
This character class consists of any of the characters in the string "+-*/". If your character class will contain many characters, you can include every character except those you specify by preceding the string of characters with the "except" operator \
. For example, the following pattern matches any character except the XML markup characters "<", "&", and ">":
find [\"<&>"]
You can also specify a character set by adding or subtracting characters from a built-in character set. To add characters, you join character classes and strings with the or operator |
. For example, the following pattern matches any hexadecimal digit:
find [digit | "AaBcCcDdEeFf"]
To subtract characters, you use the "except" operator "\". For example, the following pattern matches any octal digit:
find [digit \ "89"]
You can also use the "or" operator to join two or more built-in character classes, as in this pattern that matches any alpha-numeric character:
find [letter | digit]
Note that while you can use the "or" operator as many times as you like, you can only use the except operator once in a character class. Thus this pattern is not valid:
find [letter \ "xyz" | digit \ "7"]
You must rewrite it as follows:
find [letter | digit \ "xyz7"]
You can also specify ranges of characters using to
. For example, the following code fragment matches any character between the lowercase letters "a" and "m":
find ["a" to "m"]
You can combine ranges or exclude them from other things in a character set, including other ranges. For example, the following pattern matches any character between the lowercase letters "a" and "z" as well as the characters ".", ",", or "?"; it does not match the lowercase letters between "i" and "n" or the lowercase letter "t":
find ["a" to "z" | ".,?" \ "i" to "n" | "t"]
Take care when using character set ranges because the letters of the alphabet are not always contiguous in a character set. In the EBCDIC character encoding, for example, there are non-alphabetic characters between "A" and "Z".
Don't confuse a character class with a pattern. If you want to match any number of characters up to the first colon you can write either:
find [\ ":"]*or
find any** lookahead ":"
But if you need to match any number of characters up to a multi-character delimiter such as "</price>", you must write:
find any** lookahead "</price>"and not
find [\ "</price>"]*
The latter will match any number of characters up to the first "<", "/", "p", "r", "i", "c", "e", or ">" character, not any number of characters up to the string "</price>".
The word except
is a deprecated synonym for the "except" operator \
.
In previous versions of OmniMark, the keyword any
was required before the "except" operator in creating an "any except" character class. Thus the character class [\ "aeiou"]
would be written [any except "aeiou"]
. The form [any \ "aeiou"]
is still permitted and is identical in meaning to [\ "aeiou"]
.
Prerequisite Concepts Pattern matching |