![]() |
|
||||
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|||||
|
|
|||||
| Prerequisite Concepts | Related Topics | ||||
Validating markup |
|||||
Markup processing can be conceptually divided into separate steps: parsing (with optional validation),
filtering, and generating the final output. Some applications may have other distinct steps, such as analysis,
aggregation, and reporting. The parsing of markup is typically performed by OmniMark actions do sgml-parse
or do xml-parse. An example using the well-formed XML parser might look like:
process
do xml-parse scan #main-input
output "%c"
done
If validation is desired or required, then the validating XML parser, the Xerces parser, or the SGML parser (as appropriate) can be used instead of the non-validating XML parser. Validating parsers validate the input document as they parse it. The remaining processing steps are usually accomplished by markup rules.
Validation can be performed separately from parsing, using a schema library such as OMRELAXNG. Separation of parsing and validation steps makes the processing pipeline more flexible: the parser is not required to perform all possible validations, validators are not required to perform parsing, and the user is free to combine any of the available parsers and validators as needed.
Legacy OmniMark programs typically process the markup coming from the parser immediately, so the body of the
do sgml-parse contains a single output "%c" action to fire the markup rules. We can
accomplish the same thing by applying do markup-parse on our markup source:
process
do xml-parse scan #main-input
do markup-parse #content
output "%c"
done
done
In order to validate the markup coming from the parser, we can pass #content to a markup validator. For
instance, we can validate using OMRELAXNG.
import "omrelaxng.xmd" prefixed by relaxng.
process
do xml-parse scan #main-input
using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
output #content
done
The function relaxng.compile-schema used above reads a given textual representation of a RELAX NG
schema and returns its compiled representation as an instance of the relaxng.relaxng-schema-type opaque
type. The compiled schema is then passed to the markup sink function relaxng.validator, which validates
all markup written into it against the schema.
The same compiled schema can be used to validate multiple document instances:
import "omrelaxng.xmd" prefixed by relaxng.
process
local relaxng.relaxng-schema-type my-compiled-schema initial {relaxng.compile-schema file "my-schema.rng"}
repeat over #args as input-file
do xml-parse scan file input-file
using output as relaxng.validator against my-compiled-schema
output #content
done
again
In the examples above, #content has been only validated, without any further processing. To accomplish
both, we need to send the parsed markup in two directions. This can be done by adding a markup sink
function that does the processing, and using the & operator to split the stream in two directions:
define markup sink function
markup-processor into value string sink destination
as
using output as destination
using group "process markup"
do markup-parse #current-input
output "%c"
done
process
do xml-parse scan #main-input
using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
& markup-processor into #current-output
output #content
done
Keep in mind that #content is a markup source, not a DOM tree of the document. This means that
the markup is streamed, as the parser creates it, to be both validated and processed concurrently. We can send
the markup to an arbitrary number of destinations. For instance, we can validate the markup against two
different schemas and process it in two different ways, and the markup will stream to all four concurrently from
the parser:
do xml-parse scan #main-input
using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
& relaxng.validator against relaxng.compile-schema file "my-schema-2.rng"
& markup-processor into #current-output
& another-markup-processor
output #content
done
Apart from adding more destinations to widen the processing pipeline, we can also extend the pipeline by
breaking it into multiple steps. For example, we could direct the markup into a preparatory filtering phase
before the main processing:
define markup sink function
prepare-markup into value markup sink destination
as
using group "prepare markup"
do markup-parse #current-input
output "%c"
done
process
do xml-parse scan #main-input
using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
& prepare-markup into markup-processor into #current-output
output #content
done
The easiest way, however, to augment a legacy OmniMark program with schema validation is by using the function
relaxng.validated. This function takes the markup source created by a parser as an argument, and
produces another markup source that can be processed further. Here's an example of its use:
process
do xml-parse scan #main-input
do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
output "%c"
done
done
relaxng.validated makes markup-processor defined earlier unnecessary: the
markup source produced by validated can be processed by
do markup-parse directly. Another difference between the functions validator and
validated is that the latter inserts all validation errors into the markup it produces, so in the
previous example a markup-error rule would fire for them. The validator function, on the
other hand, reports all validation errors in OmniMark's log stream. This behavior can be modified by specifying
a different markup sink destination for validation errors:
define markup sink function
error-processor into value string sink destination
as
using output as destination
do markup-parse #current-input
output "%c"
done
process
do xml-parse scan #main-input
using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
report-errors-to error-processor into #current-output
& markup-processor
output #content
done
markup-error
log
output "<!--%n"
|| " Validation error: "
|| #message || "%n"
|| "-->"
We've already said that a markup parser converts a string source to a markup source. Another way
of looking at a markup parser is as a converter from a concrete representation of a markup stream (SGML or XML,
with or without DTD) to its abstract representation. The abstract markup stream is simply a sequence of data
characters interspersed with abstract markup events. The markup events are abstract because their original
textual representation is abstracted away and only its meaning is kept. As a consequence, a markup schema
designed for validating XML can be equally applied to validating SGML:
do sgml-parse document scan #main-input
do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
output "%c"
done
done
One problem with this example is that SGML is case-insensitive by default, while the RELAX NG specification
treats all names as case-sensitive. There are different solutions to this mismatch: one can either modify the
schema to all-uppercase names, or use an SGML declaration to specify that SGML names should be case-sensitive.
Since these solutions are somewhat intrusive, both validator and validated have an optional
argument, case-insensitive, that can be used to specify that case should be ignored during
validation:
do sgml-parse document scan #main-input
do markup-parse relaxng.validated #content
against relaxng.compile-schema file "my-schema.rng"
case-insensitive true
output "%c"
done
done
External text entity events coming from a markup parser require special attention. If the markup parser is to
proceed with parsing, it has to be supplied with replacement text for the external text entity in question. In
other words, an external-text-entity rule must be run, even if only the default
one. To prevent mixups, OmniMark also requires that exactly one external-text-entity rule be
run for each entity reference. For this reason, it is good practice to split external text entity events out of
the markup stream meant for other processing as soon as the stream is produced by markup parser, and to divert
them into a different markup sink responsible for resolving and expanding the entities.
The function split-external-text-entities in the library OMMARKUPUTILITIES can help with this task. It
takes two markup sinks as arguments, one responsible for resolving external text entities and the other to
handle the rest of the markup stream. The result of this function is a markup sink which can be directly fed all
markup produced by the parser:
define markup sink function
entity-resolver
as
using group "resolve entities"
do markup-parse #current-input
output "%c"
done
process
do sgml-parse document scan #main-input
using output as split-external-text-entities (entity-resolver,
relaxng.validator against relaxng.compile-schema file "my-schema.rng"
case-insensitive true
report-errors-to error-processor into #current-output
& markup-processor into #current-output)
& markup-processor into #current-output
output #content
done
group "resolve entities"
external-text-entity #implied
output file "%eq"
If you are using the function validated, the external text entity events will be ignored by the schema
and reproduced in the returned markup source. If you apply do markup-parse to the result of
validated, external-text-entity rules will fire as usual.
|
Prerequisite Concepts
|
Related Topics
|
Copyright © Stilo International plc, 1988-2010.