|
|||||
|
|||||
Related Topics | |||||
Pattern matching |
OmniMark allows you to search for particular strings in input data using find
rules. For example, the
following find
rule will fire if the string Hamlet:
is encountered in the input:
find "Hamlet:" output "<b>Hamlet</b>: "
Using this method, however, you would have to write a separate find
rule for each character name you
wanted to enclose in HTML bold tags. For example:
find "Hamlet:" output "<b>Hamlet</b>: " find "Horatio:" output "<b>Horatio</b>: " find "Bernardo:" output "<b>Bernardo</b>: "
This approach does not scale well, since there is much duplication involved.
This is where OmniMark patterns come in. OmniMark has rich, built-in, pattern-matching capabilities which allow
you to match strings by way of a more abstract model of a string rather than matching a specific string. For
example:
find letter+ ":"
This find
rule will match any string that contains any number of letters followed immediately by a
colon.
Unfortunately, the pattern described in this find
rule isn't specific enough to flawlessly match only
character names. It will match any string of letters that is followed by a colon that appears anywhere in the
text, meaning that words in the middle of sentences will be matched.
Words that appear in the middle of sentences rarely begin with an uppercased letter, while names usually do.
This allows us to add further details to our find rule:
find uc letter+ ":"
This find
rule matches any string that begins with an uppercase letter (uc
) followed by at least
one other letter (letter+
) and a colon (":"
).
If we were actually trying to mark up an ASCII copy of Hamlet, however, our find
rule would only match
character names that contain a single word, such as Hamlet
, Ophelia
, or Horatio
. Only the second part of two-part names would be matched, so the names Queen Gertrude
,
Lord Polonius
, and so
forth, would be incorrectly marked up.
In order to match these more complex names as well as the single-word names, we'll have to further refine our
find rule:
find uc letter+ (white-space+ uc letter+)? ":"
In this version of the find
rule, the pattern can match a second word prior to the colon. The pattern
(white-space+ uc letter+)?
can match one or more white-space characters followed by an uppercase letter
and one or more letters. All of this allows the find
rule to match character names that consist of
one or two words.
If you wanted to match a series of three numbers, you could use the following pattern:
find digit{3}
If you wanted to match either a four-digit or a five-digit number, you could use the following pattern:
find digit {4 to 5}
To match a date that occurs in the yy/mm/dd
format, the following pattern could be used:
find digit {2} "/" digit {2} "/" digit {2}
A Canadian postal code could be matched with the following pattern:
find letter digit letter " " digit letter digit
The letter
and uc
keywords that are used to create the patterns shown above are called character classes. OmniMark provides a variety of these built-in character classes:
letter
matches a single letter character, uppercase or lowercase,
uc
matches a single uppercased letter,
lc
matches a single lowercased letter,
digit
matches a single digit (0-9),
space
matches a single space character,
blank
matches a single space or tab character,
white-space
matches a single space, tab, or newline character,
any-text
matches any single character except for a newline, and
any
matches any single character.
Any pattern can be modified through the use of occurrence indicators
+
(one or more),
*
(zero or more),
?
(zero or one),
**
(zero or more upto), and
++
(one or more upto).
So, as shown in the find
rules above, for example, letter+
matches one or more letters,
letter*
matches zero or more letters, and uc?
matches zero or one uppercased letter.
You must use the identity operator to match an item on a shelf, or its key:
find ~foo[2] find ~foo{"bar"}
The expressions any*
and any+
are voracious. They will gobble up all the remaining input
regardless of any other pattern that follows them. To contain their appetite, you can use the "upto" forms of
these occurrence indicators, any **
and any ++
. These forms match only up to the next pattern.
Thus to match everything between the words start
and end
, you could write a pattern:
find "start" any ** => middle "end"
There are two restrictions on the "upto" occurrence indicators:
You can apply the upto occurrence indicators to any character class, built-in or user-defined, but in practice
they are most commonly used when used with any
. Other possible applications include using them with
any-text
which will match any characters up to the specified delimiter, as long as it occurs on the same
line.
To match up to a delimiter without consuming that delimiter, use lookahead
:
find "start" any ** => middle lookahead "end"
When matching up to a delimiter, ask yourself if the end of the data is an alternative delimiter. For
instance, if you are separating values which are delimited by the sequence \\
and you write the
pattern:
find any ++ => stuff "\\"
you will miss the last item in the sequence, because it is not followed by \\
. To grab the last
item, change the pattern to specify the end of the data as an alternate delimiter:
find any ++ => stuff ("\\" | value-end)
It is important to understand how the any ++
operator works. Two examples illustrate its properties.
First, consider the data {}
and the pattern "{" any ++ "}"
. The pattern will not match
the data because any ++
must match at least one character before the delimiter.
Second, consider the data OXX
and the pattern "O" any ++ "X"
. The pattern will not match
this data either. Although there is one character, the first "X", followed by the delimiter, the second "X", the
pattern does not match because the any ++
finds its delimiter, the first "X", before it has consumed any
data (as in the first example). It never looks at the second "X".
any ++
is useful in those situation where you are certain that there is data before the delimiter
character, or where you do not want to match at all if there is no data before the next delimiter. In choosing
between ++
and **
you should also be aware of the properties of the following pattern:
find any ** => data lookahead ("\\" | value-end)
This pattern attempts to match data up to a delimiter, without consuming the delimiter. Since the delimiter is
not consumed (because of the lookahead
) and because the pattern can match zero characters as long as
they are followed by \\
(because **
matches for zero or more characters), this rule will
probably fire twice. The first time it will consume data up to the delimiter \\
. The second time it will
be at the delimiter and will fire again (unless a previous rule matches \\
). It will not fire a third
time because OmniMark does not permit two consecutive zero-length pattern matches.
Rewriting the code with ++
solves the problem:
find any ++ => lookahead ("\\" | value-end)
Other possible solutions include rewriting the pattern to allow it to consume either a leading or trailing
delimiter:
find any ** => data "\\" find "\\" any ** => data lookahead ("\\" | value-end)
You can define your own character classes. For example:
find ["+-*/"] output "found an arithmetic operator%n"
This find
rule would fire if any one of the four arithmetic operators was encountered in the input
data.
Compound character classes can be created using except
or |
:
find [any except "}"]
The find
rule above would match any character except for a right brace.
This find rule would match any one of the arithmetic operators or a single digit:
find ["+-*/" | digit]
This one would match any of the arithmetic operators or any digit except zero (0
):
find ["+-*/" | digit except "0"]
A backslash (\
) can be used as a short-hand for except
: the previous example can be written
find ["+-*/" | digit \ "0"]
The occurrence indicators ?
and *
allow for a pattern to succeed if it is matched zero (or
more) times. In effect, this means that these patterns always match, since the zero in zero or more really means
that the pattern succeeds even if it is not found in the data.
This is very useful behavior when there is an optional element in a pattern. For example, this pattern matches
a currency amount in dollars whether or not cents are specified:
find "$" digit+ ("." digit{2})?
The sub-pattern ("." digit{2})?
will match a cents amount like .34
if it exists, but if
it does not, the pattern succeeds anyway. The pattern always matches. Sometimes it matches zero characters.
Because a pattern can succeed while matching zero characters, a rule can fire without consuming any data:
find ("$" digit+ "." digit{2})?
The entire pattern above has a zero-or-one occurrence indicator. While it will match a currency value if one exists, it will also match zero characters at any point in the input. This means that it will fire whenever no previous pattern fires, no matter where it is in the data.
Since no data has been consumed, the pattern matching context has not changed and the rule would then fire again and again. However, OmniMark does not let this happen. OmniMark does not allow two consecutive zero-length pattern matches.
Once any pattern has matched zero characters, all rules in the current scan are prevented from matching zero
characters until at least one character has been consumed. You can remove this restriction using the
null
pattern modifier.
Related Topics
|
Copyright © Stilo International plc, 1988-2010.