|
|||||
|
|||||
About OmniMark |
OmniMark is a streaming programming language. It is designed to make it easy for you to write programs using the streaming programming model.
The streaming programming model is an approach to programming that concentrates on describing the process to be applied to a piece of data, and on processing data directly as it streams from one location to another. In the streaming model, the use of data structures to model input data is eliminated, and the use of data structures to model output is greatly reduced. For instance, here is an OmniMark program that takes a document and converts all references to monetary amounts from the English style ("$29.95") to the French style ("29,95$"):
process submit "The doggy in the window costs $24.95." find "$" digit+ => dollars "." digit{2} => cents output dollars || "," || cents || "$"
This program outputs:
The doggy in the window costs 24,95$.
Here's how this program works:
process
starts a process rule. An OmniMark program is a collection of rules. A process rule fires when the program is run. It is the equivalent of function "main" in other languages.
submit
creates an OmniMark source. In this case, the content of the source is the literal string "The doggy in the window costs $24.95."
.
submit
also initiates scanning of the source it creates. Scanning is a process in which data is moved from a source to a destination and has a process applied to it as it moves.
find
defines a find rule. A find rule is a filter for data that is being scanned. The find rule specifies a pattern to be matched in the data and the process to be applied when the data is matched. When a source is scanned by find
rules, data that is not trapped streams through to the current output scope. Data that is matched by a pattern is consumed and does not stream through to output. Output generated by the rule is merged with the data streaming to the current output scope.
The pattern used in this find rule is designed to match English style dollar values. Leaving out the pattern variable assignments, which we'll discuss in a moment, it looks like this:
"$" digit+ "." digit{2}
The pattern reads as follows: Match a literal dollars sign ("$"
). Then match one or more digits (the keyword digit
with a plus sign after it, meaning "one or more"). Then match a literal period ("."
) followed by exactly 2 digits (digit{2}
).
In order for the program to create the proper output, it needs to capture the digits that represent the dollars and the cents portions of the matched data. This is done by assigning the matched data to pattern variables, using the pattern variable assignment operator =>
. This is the pattern with the pattern variables in place:
find "$" digit+ => dollars "." digit{2} => cents
When the scanning process encounters a piece of data that matches this pattern it will fire the find rule and the data matched by digit+
will be assigned to dollars
and the data matched by digit{2}
will be assigned to cents
.
Next, the actions associated with the find rule will be fired. The output statements output the dollars
and cents
values with the "," and "$" characters in the appropriate place. The output goes to the current output scope. Since the unmatched data is also going to this scope, the output of the rule is merged into the source data as it flows to its destination.
There is a lot of detail in this explanation. To get a better idea of how this program works, paste the program into the OmniMark Studio for Eclipse, create an appropriate input file, and trace through the program.
To process data other than literal strings, you need to be able to create a scanning source from external data sources. You also want to be able to send output somewhere other than the screen. In this revised version of the program the input comes from a file named on the command line and the output goes to another file named on the command line.
process using output as file #args[1] submit file #args[2] find "$" digit+ => dollars "." digit{2} => cents output dollars || "," || cents || "$"
Note that only the process
rule has changed. The find
rule that does the actual work of processing the data remains the same no matter where the data comes from or where it goes. Here's how this new process
rule works:
using output as
to make the file "output.txt" part of the current output scope. This makes it the target of all output
statements that are executed in that output scope.
submit
is now prefixed by the using output as
statement, which means that all output generated as a result of the submit will go to that output scope.
Beyond the details of the program, notice the streaming model at work:
Firstly, notice that the input data is not buffered. No data structure is created to represent it. The process of replacing the English form with the French form is carried out as the data flows from source to destination. The output is not buffered either. This program will run with equal success on a 2 kilobyte file or a 2 gigabyte file.
Secondly, notice how the program describes the process it performs. A reasonable description of the function of this program would be: "It finds the English format for expressing currency and replaces it with the French format. The input comes from one file and goes to another." And when we look at the code, we see that the process rule describes the path the data takes from input file to output file, and the find rule says find the English currency format and output the French currency format.
Thirdly, notice the abstraction involved in dealing with sources and destinations of information. The find rule does not specify what data it is acting on: it is the current input data, whatever source that may flow from. The output statement does not say where the output goes to; it goes to the current output scope, whatever that may be attached to. This means that the same scanning techniques can be applied to any piece of data from program variables, to files, to network data streams, in exactly the same manner. Scanning is a fully general data processing technique, independent of the source or destination of the data to be scanned.
Fourthly, notice how much work is done for you by the scanning mechanism. There is no data movement code in this program. There is no need to maintain pointers or offsets into the data. There is no memory management to worry about. There is no need to explicitly buffer input and output. There is not even any need to worry about the opening and closing of files. All these things are done for you, in a highly robust and optimized way.
The same streaming techniques apply to XML parsing. Here is a simple XML document:
<person> <name>Mary</name> <bio> <p>Mary had a little lamb</p> <p>Its fleece was white as snow</p> </bio> </person>
Here is a program that processes this XML document to produce HTML output:
process do xml-parse scan file "input.xml" output "<HTML>%c</HTML>" done element "person" output "<BODY>%c</BODY>" element "name" output "<H1>%c</H1>" element "bio" output "%c" element "p" output "<p>%c</p>"
You should step through this program in the OmniMark Studio for Eclipse to observe how it works. The output of the program is:
<HTML><BODY> <H1>Mary</H1> <p>Mary had a little lamb</p> <p>Its fleece was white as snow</p> </BODY></HTML>
The process rule plays the same role in this program as in the previous one. It establishes an input source and an output destination and it starts the scanning process. The difference here is that it is the parser that scans the data, not find rules. When the parser finds element markup in the source it is scanning, it fires an element rule. Just as with find rules, the unmatched data -- the "data content" in XML terms -- streams through to the current output. Thus each element rule can output into the current output stream just the way a find rule does.
Since the program is creating HTML, its element rules output HTML markup:
element "person" output "<BODY>%c</BODY>"
This rule outputs the start and end tags for the HTML BODY element. In the final output, however, there will be a good deal of markup and data between "<BODY>" and "</BODY>". Because XML data is hierarchical in nature, element rules fire hierarchically as well. The "person" element rule is suspended at the point the string "%c" occurs in the output statement. All the contents of the "person" element are then parsed, with the appropriate rules being fired. This results in the other markup and data being sent to output. Once this is done, the "person" element rule resumes and "</BODY>" is output.
The streaming model also makes it easy to process hierarchical data without the assistance of a parser. To demonstrate this, the following program processes the same XML document using find rules. Once again, you should step through this program in the OmniMark Studio for Eclipse to see how it works:
declare catch end-tag process output "<HTML>" submit file #args[1] output "</HTML>" find "<person>" output "<BODY>" submit #current-input catch end-tag output "</BODY>" find "<name>" output "<H1>" submit #current-input catch end-tag output "</H1>" find "<bio>" submit #current-input catch end-tag find "<p>" output "<p>" submit #current-input catch end-tag output "</p>" find "</" any** ">" throw end-tag
In each find
rule for a start tag, the scanning of the current input is handed off to another scanning process. In the single find rule that handles all end tags (find "</" any** ">"
) the word throw
is used to collapse the current process and return execution to the rule that started it. Execution resumes at the catch
statement.
Notice how these find
rules parallel the element rules from the previous program. The commands submit #current-input
and catch end-tag
replace the "%c"
and do the same thing: they build and collapse the hierarchy of rules that corresponds to the hierarchy in the data stream.
This illustrates how the streaming model eliminates the need to buffer input and output, and how it models the hierarchical structure found in most data, whether XML encoded or not.
To learn more about the basic principles of OmniMark programming see:
To learn about specific OmniMark syntax, just follow the links in the code samples, or consult the index to this documentation.
Visit the Stilo website for information on OmniMark training courses near you.
The OmniMark Users Group Mail list (OMUG-L) provides an opportunity for the OmniMark community to discuss issues related to OmniMark programming and related technologies. To subscribe, visit the OmniMark website..
Copyright © Stilo International plc, 1988-2008.