Validating markup

Markup processing can be conceptually divided into separate steps: parsing (with optional validation), filtering, and generating the final output. Some applications may have other distinct steps, such as analysis, aggregation, and reporting. The parsing of markup is typically performed by OmniMark actions do sgml-parse or do xml-parse. An example using the well-formed XML parser might look like:

  process
     do xml-parse scan #main-input
        output "%c"
     done
        

If validation is desired or required, then the validating XML parser, the Xerces parser, or the SGML parser (as appropriate) can be used instead of the non-validating XML parser. Validating parsers validate the input document as they parse it. The remaining processing steps are usually accomplished by markup rules.

Validating the parsed markup

Validation can be performed separately from parsing, using a schema library such as OMRELAXNG. Separation of parsing and validation steps makes the processing pipeline more flexible: the parser is not required to perform all possible validations, validators are not required to perform parsing, and the user is free to combine any of the available parsers and validators as needed.

Legacy OmniMark programs typically process the markup coming from the parser immediately, so the body of the do sgml-parse contains a single output "%c" action to fire the markup rules. We can accomplish the same thing by applying do markup-parse on our markup source:

  process
     do xml-parse scan #main-input
        do markup-parse #content
           output "%c"
        done
     done
        

In order to validate the markup coming from the parser, we can pass #content to a markup validator. For instance, we can validate using OMRELAXNG.

  import "omrelaxng.xmd" prefixed by relaxng.
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
           output #content
     done
        

The function relaxng.compile-schema used above reads a given textual representation of a RELAX NG schema and returns its compiled representation as an instance of the relaxng.relaxng-schema-type opaque type. The compiled schema is then passed to the markup sink function relaxng.validator, which validates all markup written into it against the schema.

The same compiled schema can be used to validate multiple document instances:

  import "omrelaxng.xmd" prefixed by relaxng.
  
  process
     local relaxng.relaxng-schema-type my-compiled-schema initial {relaxng.compile-schema file "my-schema.rng"}
  
     repeat over #args as input-file
        do xml-parse scan file input-file
           using output as relaxng.validator against my-compiled-schema
              output #content
        done
     again
        

Directing the markup stream

In the examples above, #content has been only validated, without any further processing. To accomplish both, we need to send the parsed markup in two directions. This can be done by adding a markup sink function that does the processing, and using the & operator to split the stream in two directions:

  define markup sink function
     markup-processor into value string sink destination
  as
     using output as destination
        using group "process markup"
        do markup-parse #current-input
           output "%c"
        done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                        & markup-processor into #current-output
           output #content
     done
        

Keep in mind that #content is a markup source, not a DOM tree of the document. This means that the markup is streamed, as the parser creates it, to be both validated and processed concurrently. We can send the markup to an arbitrary number of destinations. For instance, we can validate the markup against two different schemas and process it in two different ways, and the markup will stream to all four concurrently from the parser:

     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
                        & relaxng.validator against relaxng.compile-schema file "my-schema-2.rng"
                        & markup-processor into #current-output
                        & another-markup-processor
           output #content
     done
        

Apart from adding more destinations to widen the processing pipeline, we can also extend the pipeline by breaking it into multiple steps. For example, we could direct the markup into a preparatory filtering phase before the main processing:

  define markup sink function
     prepare-markup into value markup sink destination
  as
     using group "prepare markup"
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
                        & prepare-markup into markup-processor into #current-output
           output #content
     done
        

validated and handling of validation errors

The easiest way, however, to augment a legacy OmniMark program with schema validation is by using the function relaxng.validated. This function takes the markup source created by a parser as an argument, and produces another markup source that can be processed further. Here's an example of its use:

  process
     do xml-parse scan #main-input
        do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
           output "%c"
        done
     done
        

relaxng.validated makes markup-processor defined earlier unnecessary: the markup source produced by validated can be processed by do markup-parse directly. Another difference between the functions validator and validated is that the latter inserts all validation errors into the markup it produces, so in the previous example a markup-error rule would fire for them. The validator function, on the other hand, reports all validation errors in OmniMark's log stream. This behavior can be modified by specifying a different markup sink destination for validation errors:

  define markup sink function
     error-processor into value string sink destination
  as
     using output as destination
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                                report-errors-to error-processor into #current-output
                        & markup-processor
           output #content
     done
  
  markup-error
     log
     output "<!--%n"
        || "   Validation error: "
        || #message || "%n"
        || "-->"
        

Validating SGML against a schema

We've already said that a markup parser converts a string source to a markup source. Another way of looking at a markup parser is as a converter from a concrete representation of a markup stream (SGML or XML, with or without DTD) to its abstract representation. The abstract markup stream is simply a sequence of data characters interspersed with abstract markup events. The markup events are abstract because their original textual representation is abstracted away and only its meaning is kept. As a consequence, a markup schema designed for validating XML can be equally applied to validating SGML:

     do sgml-parse document scan #main-input
        do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
           output "%c"
        done
     done
        

One problem with this example is that SGML is case-insensitive by default, while the RELAX NG specification treats all names as case-sensitive. There are different solutions to this mismatch: one can either modify the schema to all-uppercase names, or use an SGML declaration to specify that SGML names should be case-sensitive. Since these solutions are somewhat intrusive, both validator and validated have an optional argument, case-insensitive, that can be used to specify that case should be ignored during validation:

     do sgml-parse document scan #main-input
        do markup-parse relaxng.validated                  #content
                                                   against relaxng.compile-schema file "my-schema.rng"
                                          case-insensitive true
           output "%c"
        done
     done
        

Handling external text entities

External text entity events coming from a markup parser require special attention. If the markup parser is to proceed with parsing, it has to be supplied with replacement text for the external text entity in question. In other words, an external-text-entity rule must be run, even if only the default one. To prevent mixups, OmniMark also requires that exactly one external-text-entity rule be run for each entity reference. For this reason, it is good practice to split external text entity events out of the markup stream meant for other processing as soon as the stream is produced by markup parser, and to divert them into a different markup sink responsible for resolving and expanding the entities.

The function split-external-text-entities in the library OMMARKUPUTILITIES can help with this task. It takes two markup sinks as arguments, one responsible for resolving external text entities and the other to handle the rest of the markup stream. The result of this function is a markup sink which can be directly fed all markup produced by the parser:

  define markup sink function
     entity-resolver
  as
     using group "resolve entities"
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do sgml-parse document scan #main-input
        using output as split-external-text-entities (entity-resolver,
                                                      relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                                                              case-insensitive true
                                                              report-errors-to error-processor into #current-output
                                                      & markup-processor into #current-output)
                        & markup-processor into #current-output
           output #content
     done
  
  group "resolve entities"
  external-text-entity #implied
     output file "%eq"
        

If you are using the function validated, the external text entity events will be ignored by the schema and reproduced in the returned markup source. If you apply do markup-parse to the result of validated, external-text-entity rules will fire as usual.