The input to a markup parser need not come directly from an XML or SGML file. You can process a data source by first scanning the data and then parsing the output of the scanning process. This is particularly valuable when you are translating data into XML or SGML and want to use the parser to verify the structure of the output document. It is also a useful way to do certain kinds of processing by first normalizing the data to XML or SGML format and then processing the normalized data.
One way to process a document by scanning and then parsing would be to stream scanning output to a buffer or a
file and then to parse the file or buffer. However, this approach is resource intensive and can be slow. You can
avoid buffering the intermediate form by feeding the output of the scanning process directly to the parser. The
output of the scanning process becomes the input source of the parsing operation, and they run as two coroutines.
define string source function make-xml as submit #main-input process do xml-parse scan make-xml output "%c" done
This is a fairly general technique that can be applied to a variety of input formats. The only difference would
be in the find
rules run by the submit action, which are responsible for generating the
intermediate well-formed XML. One example of such rules will be presented below.
If you are using this technique to validate an XML or SGML document you are creating from other data, or if you
wish to preserve the intermediate XML, just add another stream to the output scope of the submit action
to capture the data it emits:
define string source function make-xml as using output as #current-output & file "output.xml" submit #main-input
In the code above, the output of the find rules will go both to the stream output-file
and to the
parser. Note that output "%c"
in the do xml-parse
block has been replaced with
suppress
, which suppresses all output from the parser and markup rules.
Take a simple example of input containing comma-separated values, in CSV format. The
find
rules that convert this format to XML might look like this:
; ignore empty lines find (value-start | line-start) (line-end | value-end) find value-start | line-start output "<row><value>" find line-end | value-end output "</value></row>" find "," output "</value><value>" find '"' any ** => val '"' (lookahead '"' => quote)? output val output quote when quote is specified
You must also wrap the output of the submit action into a single root element required by XML.
define string source function make-xml as output "<csv>" submit #main-input output "</csv>"
Now you have a nearly complete program for processing CSV files; the only missing ingredient is a set of element
rules.
The find rules above have a problem that may not be immediately apparent: they don't escape the
special XML characters. If any less-than character or ampersand is present in the CSV input, the intermediate
XML will be malformed. You can fix this problem with some more find rules. Alternatively, you can avoid the
problem by taking the XML parser out of the loop and replacing the intermediate XML stream with a markup stream.
define markup source function make-markup as submit #main-input process do markup-parse make-markup output "%c" done
Note that the make-markup function does not emit the <csv> wrapper any more, because
do markup-parse
does not require a root element. What it does require is a well-formed stream of markup
events, which the find rules will have to emit instead of the well-formed XML elements. Each markup
element event will be bound to a particular element-declaration
. You can obtain the necessary element
declarations from the declared-elements of
a DTD, but in this case they are easier to construct
manually:
constant element-declaration row-declaration initial { create-element-declaration "row" attributes { } content element-content-model } constant element-declaration value-declaration initial { create-element-declaration "value" attributes { } content cdata-content-model }
All that remains is to replace the output
actions with signal throw
actions, in order to emit
a stream of markup events instead of text. Take care to emit a #markup-end
for each emitted #markup-start
, and with the very same event.
global markup-element-event row-element global markup-element-event value-element ; ignore empty lines find (value-start | line-start) (line-end | value-end) find value-start | line-start set row-element to create-element-event row-declaration attributes { } set value-element to create-element-event value-declaration attributes { } signal throw #markup-start row-element signal throw #markup-start value-element find line-end | value-end signal throw #markup-end value-element signal throw #markup-end row-element find "," signal throw #markup-end value-element set value-element to create-element-event value-declaration attributes { } signal throw #markup-start value-element find '"' any ** => val '"' (lookahead '"' => quote)? output val output quote when quote is specified