Markup processing control

Markup processing often encompasses a spectrum of complexity. OmniMark has a number of features that the programmer can employ to control how markup is processed. This topic explores some of those features, from the simplest element rules, to more complex examples using groups and markup sink functions.

OmniMark is a rule-based language, and the main ingredient of many OmniMark programs is element rules. The body of every element rule in your program must take care of processing the element's content. The simplest way to accomplish this is to delegate the processing of content to other markup rules using %c:

  element "simple"
     output "%c"

The above rule will reproduce the output and effect of all rules that its content will fire. Wrapping a part of the input in element <simple> will make no difference to the output.

Alternatively, the rule could perform some actions before and after the processing of its content. For example, it could output something:

  element "parenthesized"
     output "("
     output "%c"
     output ")"

The effect of having element <parenthesized> in the input is to wrap the result of processing the element content with parentheses. You can also redirect the content processing output to another destination, or discard it completely:

  element "redirect"
     using output as file (attribute "filename")
        output "%c"
  
  element "discard"
     suppress

None of the rules above alters the result of content processing, they merely add to it or change its destination. The following rule subjects the output of %c to another round of processing:

  element "indent"
     repeat scan "%c"
     match any-text+ => line
        output "  " || line
     match "%n"
        output "%n"
     again

The indent rule indents each line produced from its content. To accomplish this, it alters the result of %c but not the way this result gets produced. This is good: the rule fulfills its purpose with a localized code change. If you tried to accomplish the same effect in a single pass, you would have to modify every place where a line could be emitted within an <indent> element.

Using #content

The previous rule is an example of post-processing of content. It invokes other rules to process its content using %c, and then scans through their output. An alternative approach is to pre-process the content before invoking other rules by using #content instead of %c. Here are a few examples:

  element "redirect-content"
     using output as alternative-content-processor ()
        output #content
  
  element "distribute-content"
     using output as alternative-content-processor () & relaxng.validator against my-schema
        output #content
  
  element "really-discard"
     put #suppress #content
  
  element "half-marked-up"
     do markup-parse up-translate-content (#content)
        output "%c"
     done

The first rule above, redirect-content, does not invoke any markup rules itself. Instead it sends its entire content off to alternative-content-processor, a markup sink function which may be imported from another module, to process it in any way it pleases.

The rule distribute-content is similar but sends its content in parallel to two destinations, the alternative-content-processor to be processed and the relaxng.validator function to be validated at the same time.

The really-discard rule is similar to the rule discard you have seen earlier, but where the latter discarded the output of content processing, really-discard discards the content processing itself. By directing its #content to #suppress, this rule avoids invoking any rules that would process its markup.

Finally, the rule half-marked-up performs a pre-processing of its content through the function up-translate-content. For example, if the content of element <half-marked-up> was

  This is one paragraph.
  
  This is <em>another</em> paragraph,
  as you can tell by the blank line preceding it.

up-translate-content could convert this input to appear as

  <para>This is one paragraph.</para>
  
  <para>This is <em>another</em> paragraph,
  as you can tell by the blank line preceding it.</para>

After this pre-processing step, the rule half-marked-up applies do markup-parse and invokes regular content processing with %c. Notice that both the original element <em> and the newly introduced element <para> can be processed by the regular element rules, as if they were both present in the content from beginning. The function up-translate-content could be defined in a different module as follows:

  export markup source function
     up-translate-content (value markup source m)
  as
     do xml-parse scan "<up-translated>"
                    || wrap-implicit-paragraphs (split-data-content (m, #current-output))
                    || "</up-translated>"
        output "%c"
     done
  
  element "up-translated"
     output #content

This function in turn relies on two others: split-data-content to separate the plain text from markup events which are sent directly to output of up-translate-content, and wrap-implicit-paragraphs to insert XML tags in the plain text.

  define string source function
     split-data-content (value markup source m,
                         value markup sink   events)
  as
     repeat
        output m take any*
        exit
  
      catch #markup-start event
        signal to events rethrow
      catch #markup-point event
        signal to events rethrow
      catch #markup-end event
        signal to events rethrow
     again
  
  
  define string source function
     wrap-implicit-paragraphs (value string source s)
  as
     repeat scan s
     match lookahead any-text
        output "<para>" || s take (any ** lookahead ("%n%n" | value-end)) || "</para>"
     match "%n"
        output "%n"
     again

Using groups

Dividing the processing of your content into multiple steps is usually the best way to improve your program, as it is less intrusive and lets you reuse the common processing code. Still, sometimes neither post-processing nor pre-processing of content is enough and you need to alter the very way content is processed. The easiest way to achieve this is with groups.

If you have an element whose content is completely different from the rest of your input, you will probably want to process it using a completely different set of rules from the regular one. To do this, simply put your %c into a using group scope:

  element "foreign"
     using group "process foreign elements"
        output "%c"

If, on the other hand, the content model of your element is not completely unique, you may want to use both the common rules and the special ones:

  element "half-foreign"
     using group "process foreign elements" & #group
        output "%c"

Keep in mind that for every element instance in your content, only a single element rule can fire: either a rule from your group "process foreign elements" or one of the common rules. That means you cannot have an unguarded element #implied rule in both groups, for example. But what if you actually want to perform both rules, because they both perform useful actions? One solution is to merge the body of the common rule into the other rule. If you would rather avoid the code duplication, you can apply the technique used by the distribute-content rule and send your content to be processed by both groups. You just need to define two markup sink functions that invoke the proper rules:

  define markup sink function
     common-content-processor (value string sink destination)
  as
     do markup-parse #current-input
        put destination "%c"
     done
  
  define markup sink function
     foreign-content-processor (value string sink destination)
  as
     using group "process foreign elements"
     do markup-parse #current-input
        put destination "%c"
     done
  
  element "distribute-half-foreign"
     using output as foreign-content-processor (#current-output)
                   & common-content-processor (#current-output)
        output #content

Now that the content is processed by two groups of rules independently, each group is allowed to have an element #implied rule, and they can (and must) both fire for each element in the content.

Ordering the outputs

The reason #current-output is passed as argument to the two content-processor functions is to let them output into it. There will be a problem, however, if they should both do that for the same part of content, because the two outputs will then be merged together. For example, if neither group contains any data-content or translate rule, the content of input <distribute-half-foreign>Hello, World!</distribute-half-foreign> would be duplicated and the output would be Hello, World!Hello, World!.

If you do need both outputs, instead of merging them as they come you may want to order them properly in your output by temporarily buffering one and outputting it after the other:

  element "distribute-half-foreign"
     local stream common-output
  
     open common-output as buffer
     using output as foreign-content-processor (#current-output)
                   & common-content-processor (common-output)
        output #content
     close common-output
     output common-output

Alternatively, instead of storing the output of content processing you can use a markup-buffer to store your content before processing it. This lets you control both the order of your outputs and the order of processing:

  import "ommarkuputilities.xmd" unprefixed
  
  element "distribute-half-foreign"
     local markup-buffer my-content
  
     using output as foreign-content-processor (#current-output) & markup sink my-content
        output #content
     using output as common-content-processor (#current-output)
        output my-content

Since the content is not processed in parallel any more there is no need to use the & operator. You can write this rule to the same effect without relying on the markup sink functions to wrap the rule invocations:

  import "ommarkuputilities.xmd" unprefixed
  element "distribute-half-foreign"
     local markup-buffer my-content
  
     using output as my-content
        output #content
  
     using group "process foreign elements"
     do markup-parse my-content
        output "%c"
     done
  
     do markup-parse my-content
        output "%c"
     done