Parsing generated markup

The input to a markup parser need not come directly from an XML or SGML file. You can process a data source by first scanning the data and then parsing the output of the scanning process. This is particularly valuable when you are translating data into XML or SGML and want to use the parser to verify the structure of the output document. It is also a useful way to do certain kinds of processing by first normalizing the data to XML or SGML format and then processing the normalized data.

One way to process a document by scanning and then parsing would be to stream scanning output to a buffer or a file and then to parse the file or buffer. However, this approach is resource intensive and can be slow. You can avoid buffering the intermediate form by feeding the output of the scanning process directly to the parser. The output of the scanning process becomes the input source of the parsing operation, and they run as two coroutines.

  define string source function
     make-xml
  as
     submit #main-input
  
  process
     do xml-parse scan make-xml
        output "%c"
     done

This is a fairly general technique that can be applied to a variety of input formats. The only difference would be in the find rules run by the submit action, which are responsible for generating the intermediate well-formed XML. One example of such rules will be presented below.

If you are using this technique to validate an XML or SGML document you are creating from other data, or if you wish to preserve the intermediate XML, just add another stream to the output scope of the submit action to capture the data it emits:

  define string source function
     make-xml
  as
     using output as #current-output & file "output.xml"
        submit #main-input

In the code above, the output of the find rules will go both to the stream output-file and to the parser. Note that output "%c" in the do xml-parse block has been replaced with suppress, which suppresses all output from the parser and markup rules.

Example: CSV

Take a simple example of input containing comma-separated values, in CSV format. The find rules that convert this format to XML might look like this:

  ; ignore empty lines
  find (value-start | line-start) (line-end | value-end)
  
  find value-start | line-start
     output "<row><value>"
  
  find line-end | value-end
     output "</value></row>"
  
  find ","
     output "</value><value>"
  
  find '"' any ** => val '"' (lookahead '"' => quote)?
     output val
     output quote
        when quote is specified

You must also wrap the output of the submit action into a single root element required by XML.

  define string source function
     make-xml
  as
     output "<csv>"
     submit #main-input
     output "</csv>"

Now you have a nearly complete program for processing CSV files; the only missing ingredient is a set of element rules.

From XML to markup

The find rules above have a problem that may not be immediately apparent: they don't escape the special XML characters. If any less-than character or ampersand is present in the CSV input, the intermediate XML will be malformed. You can fix this problem with some more find rules. Alternatively, you can avoid the problem by taking the XML parser out of the loop and replacing the intermediate XML stream with a markup stream.

  define markup source function
     make-markup
  as
     submit #main-input
  
  process
     do markup-parse make-markup
        output "%c"
     done

Note that the make-markup function does not emit the <csv> wrapper any more, because do markup-parse does not require a root element. What it does require is a well-formed stream of markup events, which the find rules will have to emit instead of the well-formed XML elements. Each markup element event will be bound to a particular element-declaration. You can obtain the necessary element declarations from the declared-elements of a DTD, but in this case they are easier to construct manually:

  constant element-declaration row-declaration   initial { create-element-declaration "row"
                                                                           attributes { }
                                                                              content element-content-model }
  constant element-declaration value-declaration initial { create-element-declaration "value"
                                                                           attributes { }
                                                                              content cdata-content-model }

All that remains is to replace the output actions with signal throw actions, in order to emit a stream of markup events instead of text. Take care to emit a #markup-end for each emitted #markup-start, and with the very same event.

  global markup-element-event row-element
  global markup-element-event value-element
  
  ; ignore empty lines
  find (value-start | line-start) (line-end | value-end)
  
  find value-start | line-start
     set row-element to create-element-event row-declaration attributes { }
     set value-element to create-element-event value-declaration attributes { }
     signal throw #markup-start row-element
     signal throw #markup-start value-element
  
  find line-end | value-end
     signal throw #markup-end value-element
     signal throw #markup-end row-element
  
  find ","
     signal throw #markup-end value-element
     set value-element to create-element-event value-declaration attributes { }
     signal throw #markup-start value-element
  
  find '"' any ** => val '"' (lookahead '"' => quote)?
     output val
     output quote
        when quote is specified