Linking chains of streaming markup filters

Just as you can use string source and string sink to stream character data through a series of text filters, you can use markup source and markup sink to stream parsed markup data through a series of markup filters, with no intermediate buffering.

The starting point of a chain of markup filters is always a markup parser. You can use do sgml-parse, do xml-parse, or an external parser such as do markup-parse xerces.xml. The beginning of the markup-processing chain is also the only place you should use any of these actions; once the markup is parsed, there is no need to convert it to plain text only to have it parsed again.

The purpose of the parsing step is to convert a string source to a markup source. Within the body of the parsing action, #content is a markup source that represents result of the parse. That is the starting point of the markup-processing pipeline.

  define function
     handle-markup-source (value markup source s)
  elsewhere
  
  
  process
     do sgml-parse document scan #main-input
        handle-markup-source (#content)
     done

Here, handle-markup-source () is a function that will process the markup source it takes as an argument. Alternatively, we can launch the markup processing by outputting the #content into a markup sink function that will consume and process it:

  define markup-sink function
     handle-markup-as-sink ()
  elsewhere
  
  
  process
     do sgml-parse document scan #main-input
        using output as handle-markup-as-sink ()
           output #content
     done

The end point of a markup-processing pipeline is typically a set of element and other markup rules. In order to activate the rules, apply do markup-parse to a markup source and trigger the rules using the %c format item or the suppress action:

  define string source function
     handle-markup-source (value markup source s)
  as
     do markup-parse s
        output "%c"
     done
  
  
  process
     do sgml-parse document scan #main-input
        output handle-markup-source (#content)
     done

Incidentally, this example is semantically equivalent to the following, much simpler program fragment:

  process
     do sgml-parse document scan #main-input
        output "%c"
     done

In this example the separation of markup processing from markup parsing may seem pointless. We shall see how it makes the processing pipeline more flexible in more complicated cases.

Example: SGML or XML to HTML or XHTML

Let us use the same example task of converting input text to HTML that has been laid out in Linking chains of streaming filters using string source filters. The following filtering functions were used in that example:

  define string source function 
     compress-whitespace (value string source s)
  as
     repeat scan s
     ; ...
  
  
  define string source function 
     text2xml (value string source s)
  as
     submit s
     ; ...
  
  
  define string source function 
     tidy-xml (value string source s)
  as
     do xml-parse scan s
     ; ...
  
  
  define string source function 
     xml2html (value string source s)
  as
     do xml-parse scan s
     ; ...

The compress-whitespace () and text2xml () functions deal with processing of plain text before it gets parsed, so we shall not change them. The function tidy-xml () and xml2html (), on the other hand, clearly work on markup, so we shall modify them to

  define markup source function 
     tidy-markup (value markup source s)
  as
     do markup-parse scan s
     ; ...
  
  
  define string source function 
     markup2html (value markup source s)
  as
     do markup-parse scan s
     ; ...

The reason for renaming the functions tidy-xml () and xml2html () to tidy-markup () and markup2html (), respectively, is to emphasize that they do not operate on the XML representation of a marked-up document any more: they now expect a parsed markup stream. Their input may come from a parsed XML document, but they would accept a parsed SGML document just the same.

The function text2xml () produces a string source, whereas tidy-markup () expects a markup source. Although a string source can be used wherever a markup source is required, we want tidy-markup () to be able to react to markup events in its input. The markup events in question can be inserted into the input by converting the string source to a markup source using, say, an XML parser:

  define markup source function 
     xml2markup (value string source s)
  as
     do xml-parse scan s
        output #content
     done

Our new chain of streaming filters now looks like this:

  process
     output markup2html (tidy-markup (xml2markup (text2xml (compress-whitespace (#main-input)))))

Compared to the old pipeline, the new one may look longer and more complicated. The appearance is misleading, however:

The reason the new pipeline is longer is that it makes the parsing step explicit and thus gives more control to the user.
The new xml2markup () function is very generic and could be reused in any other pipeline that involves XML parsing. Also, the functions tidy-markup () and markup2html () have been made more generic by the fact they accept any parsed markup, not just markup in XML form.
The old functions tidy-xml () and xml2html () both had to parse their input represented as XML. Now the only parsing is performed by parse-xml ().
The function tidy-xml () had to reproduce most of the original XML representation, so that it could later be parsed by xml2html (). That may include more than producing start and end tags for elements: special characters must be properly escaped, and we may also want to preserve comments, processing instructions, and marked sections in HTML. The new function tidy-markup () can concentrate on its core responsibility, which is tidying the markup.
The repeated parsing and XML generation become more and more important as the processing pipeline gets longer and longer. If we had inserted another step in the old chain between xml2html () and tidy-xml (), like add-styles-to-xml (), it would have to parse the input XML and reproduce XML output all over again.

A markup filter example

The easiest way to start a markup filter like tidy-markup () is by applying do markup-parse to the input markup stream. This action will cause the markup rules to be fired by markup events in the stream. In order to generate the output markup stream, markup rules have two built-in variables at their disposal: #current-markup-event and #content. To demonstrate their use, let us assume that tidy-markup () is required to make the following modifications to its input:

Leave all verbatim elements and their content completely unmodified. In the rest of the input:
Remove all annotation elements, including their entire content.
Replace all span elements by their content; in other words, remove tags for span elements.
Remove all processing instructions and comments.

The specified markup filter might be implemented in the following way:

  define string source function 
     tidy-markup (value markup source s)
  as
     do markup-parse s
        output "%c"
     done
  
  
  element "verbatim"
     signal throw #markup-start #current-markup-event
     output #content
     signal throw #markup-end #current-markup-event
  
  
  element "annotation"
     put #suppress #content
  
  
  element "span"
     output "%c"
  
  
  element #implied
     signal throw #markup-start #current-markup-event
     output "%c"
     signal throw #markup-end #current-markup-event

The span rule and the implied rule in this example invoke %c to delegate the processing of the element content to other markup rules. This is not any different from how a text-producing rule handles markup. The rules for verbatim and annotation, on the other hand, use #content instead of %c. The difference between the two is that #content represents the unprocessed content of the current element, just as it appears in the input stream, while %c represents the same content processed by other markup rules. The line output #content produces the unmodified element content, while output "%c" delegates the processing to other markup rules. Finally, the line put #suppress #content in the rule handling annotation elements consumes the entire element content without firing any markup rules and suppresses it.

The lines beginning with signal throw are reproducing the markup events standing for element tags in the original XML. Both the start and the end tag are represented by the same element event, #current-markup-event in the example. The beginning of the element region event is signalled with the catch #markup-start, and its end with #markup-end.

Prerequisite Concepts

Related Topics

do markup-parse