swirl
Guide to OmniMark 9   OmniMark home
docs home 
IndexConceptsTasksSyntaxLibrariesLegacy LibrariesErrors
 
Prerequisite Concepts     Related Topics  

Linking Chains of Streaming Filters

Previous versions of OmniMark made it easy to write streaming filters that applied several filtering rules in parallel. But it was not possible to apply several different filters in sequence without buffering the content between them. This tended to favor the creation of one-pass filters, even where a one pass approach was not the most natural algorithm. With OmniMark 8, it is equally easy to create a set of independent filters and to stream data through those filters sequentially, as it is to write a single filter with multiple rules. This allows you to choose the most natural algorithm to solve each content engineering challenge you encounter.

To enable unlimited streaming, OmniMark introduces sink and source types. Here is a function of type string source, which means that the function returns a source of string data. It also takes an argument of type string source, meaning that it expects to be passed a source of string data. The purpose of the function is to remove excess white space from string data:

  define string source function 
     compress-whitespace  value string source my-string-source
  as
     repeat scan my-string-source
     match blank* "%n" blank*
       output "%n"
     match blank+
       output "%_"
     match [\ white-space]+ => chars
       output chars
     again

This function can be called in any context that expects a data source, such as a submit statement. It can accept any source as an argument, such as #main-input.

  process
    submit compress-whitespace #main-input

This program will stream input from #main-input, through the compress-whitespace function, to the submit, where it will can be processed by find rules. The find rules will receive a stream of data from which all excess whitespace has been removed by the compress-whitespace function. Data flows through the program in a completely streaming fashion, with no buffering of data.

It would have been possible to write a very similar function in earlier versions of OmniMark, creating an input function that took a stream argument.

  define input function 
     compress-whitespace  value stream my-string
  as
    repeat scan my-string
    match blank* "%n" blank*
       output "%n"
    match blank+
       output "%_"
    match [\ white-space]+ => chars
        output chars
    again

That function could have been called in the same way. But while the input function would have a streaming relationship with the submit statement, there would be no streaming of the my-string argument to the function. In executing such a function, OmniMark would read in the whole of #main-input, buffer it in memory, and pass it as a string argument to the function. When the function is written with the new string source type, data is streamed incrementally from #main-input to the function and from the function to the submit statement. No buffering takes place. This means that you can now connect any number of streaming filters in a chain. Suppose that you want to process an unstructured document to create an XML representation and then create an HTML output. You could do this with a traditional OmniMark context-translate program, however, this would mean that you could only have one find rule pass and one markup rule pass at the data. With OmniMark 8's streaming unlimited, you can connect as many text filters or markup parsers together as you want. In this case, the most natural algorithm might be:

  1. Filter the input text to remove excess white space. This makes it easier to write the next filter, by simplifying white-space handling (function compress-whitespace):
      define string source function 
         compress-whitespace value string source text
      as
         repeat scan text
         ...
    
  2. Filter the output of compress-whitespace to wrap XML tags around the elements of the input data in the simplest possible fashion (function text2xml).
      define string source function 
         text2xml value string source text
      as
         submit text
         ...
    
  3. Parse the output of text2xml to tidy up the XML, removing unneeded elements and adding structure and ID attributes (function tidy-xml).
      define string source function 
         tidy-xml  value string source markup
      as
         do xml-parse scan markup
         ...
    
  4. Parse the output of tidy-xml to create HTML (function xml2html).
      define string source function 
         xml2html value string source markup
      as
         do xml-parse scan markup
         ...
    

You would then invoke those functions as a chain of streaming filters with a simple output statement:

  process
    output xml2html tidy-xml text2xml compress-whitespace #main-input

The flow of data here is from right to left (as the program is written). Each function, starting with compress-whitespace on the right, takes a string source as its input and returns a string source to the function on its left.

Another way to structure this program would be to write the xml2html function as a sink rather than a source of data. This means that the function becomes a destination to which data is sent, and processes that data before sending it on to another sink. Here is the xml2html function written as a sink function:

  define string sink function 
     xml2html value string sink destination
  as
     using output as destination
     do xml-parse scan #current-input
        output "%c"
     done

This function can be used anywhere a sink (data destination) is expected, such as a using output as statement, and can accept any sink expression as an argument, such as #main-output :

  process
     using output as xml2html #main-output 
       output tidy-xml text2xml compress-whitespace #main-input

Here again, data is streamed through the chain of streaming filters implemented by the string source functions to the current output scope, which is the string sink function xml2html, which in turn streams it to #main-output. Once again, the data is never buffered. The output data streams from left to right (as the program is written) from the xml2html function to the main output.

Since the current output scope of an OmniMark program can include more than one sink, you can define multiple string sink functions and stream data to them simultaneously. In the following example, the original source is converted to XML, then that XML is streamed directly to a file, to an HTML output function, and to an XSL/FO output function, creating three different output formats simultaneously:

  process
     using output as xml2html file #args[1] 
               & xml2fo file #args[2]
               & file #args[3]
      output tidy-xml text2xml compress-whitespace #main-input

Prerequisite Concepts
 
  Related Topics
 
 

Top [ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ LEGACY LIBRARIES ] [ ERRORS ]

OmniMark 9.1.0 Documentation Generated: September 2, 2010 at 1:35:14 pm
If you have any comments about this section of the documentation, please use this form.

Copyright © Stilo International plc, 1988-2010.