Just as you can use
string source and
string sink to stream character data through a series of
text filters, you can use
markup source and
markup sink to stream parsed markup data through a
series of markup filters, with no intermediate buffering.
The starting point of a chain of markup filters is always a markup parser. You can use
do xml-parse, or an external parser such as
do markup-parse xerces.xml. The
beginning of the markup-processing chain is also the only place you should use any of these actions; once the
markup is parsed, there is no need to convert it to plain text only to have it parsed again.
The purpose of the parsing step is to convert a
string source to a
markup source. Within the body
of the parsing action,
#content is a
markup source that represents result of the parse. That is the
starting point of the markup-processing pipeline.
define function handle-markup-source (value markup source s) elsewhere process do sgml-parse document scan #main-input handle-markup-source (#content) done
handle-markup-source () is a function that will process the
markup source it takes as an
argument. Alternatively, we can launch the markup processing by outputting the
#content into a
markup sink function that will consume and process it:
define markup-sink function handle-markup-as-sink () elsewhere process do sgml-parse document scan #main-input using output as handle-markup-as-sink () output #content done
The end point of a markup-processing pipeline is typically a set of
element and other markup rules. In order to activate the rules, apply
do markup-parse to a
markup source and
trigger the rules using the
%c format item or the
define string source function handle-markup-source (value markup source s) as do markup-parse s output "%c" done process do sgml-parse document scan #main-input output handle-markup-source (#content) done
Incidentally, this example is semantically equivalent to the following, much simpler program fragment:
process do sgml-parse document scan #main-input output "%c" done
In this example the separation of markup processing from markup parsing may seem pointless. We shall see how it makes the processing pipeline more flexible in more complicated cases.
Let us use the same example task of converting input text to HTML that has been laid out in Linking chains of streaming filters using
string source filters. The following
filtering functions were used in that example:
define string source function compress-whitespace (value string source s) as repeat scan s ; ... define string source function text2xml (value string source s) as submit s ; ... define string source function tidy-xml (value string source s) as do xml-parse scan s ; ... define string source function xml2html (value string source s) as do xml-parse scan s ; ...
compress-whitespace () and
text2xml () functions deal with processing of plain text
before it gets parsed, so we shall not change them. The function
tidy-xml () and
(), on the other hand, clearly work on markup, so we shall modify them to
define markup source function tidy-markup (value markup source s) as do markup-parse scan s ; ... define string source function markup2html (value markup source s) as do markup-parse scan s ; ...
The reason for renaming the functions
tidy-xml () and
xml2html () to
markup2html (), respectively, is to emphasize that they do not operate on the XML
representation of a marked-up document any more: they now expect a parsed markup stream. Their input may come
from a parsed XML document, but they would accept a parsed SGML document just the same.
text2xml () produces a
string source, whereas
tidy-markup () expects a
markup source. Although a
string source can be used wherever a
markup source is required,
tidy-markup () to be able to react to markup events in its input. The markup events in
question can be inserted into the input by converting the
string source to a
using, say, an XML parser:
define markup source function xml2markup (value string source s) as do xml-parse scan s output #content done
Our new chain of streaming filters now looks like this:
process output markup2html (tidy-markup (xml2markup (text2xml (compress-whitespace (#main-input)))))
Compared to the old pipeline, the new one may look longer and more complicated. The appearance is misleading, however:
xml2markup ()function is very generic and could be reused in any other pipeline that involves XML parsing. Also, the functions
markup2html ()have been made more generic by the fact they accept any parsed markup, not just markup in XML form.
xml2html ()both had to parse their input represented as XML. Now the only parsing is performed by
tidy-xml ()had to reproduce most of the original XML representation, so that it could later be parsed by
xml2html (). That may include more than producing start and end tags for elements: special characters must be properly escaped, and we may also want to preserve comments, processing instructions, and marked sections in HTML. The new function
tidy-markup ()can concentrate on its core responsibility, which is tidying the markup.
tidy-xml (), like
add-styles-to-xml (), it would have to parse the input XML and reproduce XML output all over again.
The easiest way to start a markup filter like
tidy-markup () is by applying
to the input markup stream. This action will cause the markup rules to be fired by markup events in the stream.
In order to generate the output markup stream, markup rules have two built-in variables at their disposal:
#content. To demonstrate their use, let us assume that
tidy-markup () is required to make the following modifications to its input:
verbatimelements and their content completely unmodified. In the rest of the input:
annotationelements, including their entire content.
spanelements by their content; in other words, remove tags for
The specified markup filter might be implemented in the following way:
define string source function tidy-markup (value markup source s) as do markup-parse s output "%c" done element "verbatim" signal throw #markup-start #current-markup-event output #content signal throw #markup-end #current-markup-event element "annotation" put #suppress #content element "span" output "%c" element #implied signal throw #markup-start #current-markup-event output "%c" signal throw #markup-end #current-markup-event
span rule and the implied rule in this example invoke
%c to delegate the processing of
the element content to other markup rules. This is not any different from how a text-producing rule handles
markup. The rules for
annotation, on the other hand, use
%c. The difference between the two is that
#content represents the unprocessed content
of the current element, just as it appears in the input stream, while
%c represents the same content
processed by other markup rules. The line
output #content produces the unmodified element content, while
output "%c" delegates the processing to other markup rules. Finally, the line
#content in the rule handling
annotation elements consumes the entire element content without
firing any markup rules and suppresses it.
The lines beginning with
signal throw are reproducing the markup events standing for element tags in
the original XML. Both the start and the end tag are represented by the same element event,
#current-markup-event in the example. The beginning of the element region event is signalled with the
#markup-start, and its end with