|
|||||
|
|||||
Related Syntax | Related Concepts | ||||
control structure |
do xml-parse |
Syntax
do xml-parse document (with id-checking Boolean-expression)? (with utf-8 Boolean-expression)? (creating xml-dtds key keyname)? scan input source action+ done do xml-parse scan input source action+ done do xml-parse instance (with document-element element-name)? (with (xml-dtd key key | current xml-dtd))? (with id-checking Boolean-expression)? scan input source action + done
You can parse an XML document using OmniMark's integrated XML parser.
You can invoke the XML parser with do xml-parse
. To invoke the SGML parser, use do sgml-parse
. To use a third party parser, use do markup-parse
.
The do xml-parse
statement prepares the parser to parse a document. Actual parsing begins with a call to the parse continuation operator "%c"
from within the do xml-parse
block.
To prepare the parser for parsing, you must do the following:
To configure the parser for well-formed parsing, use the following syntax. (In most of the examples that follow file #args[1]
will be used as the source of the document to be parsed. You can use any valid OmniMark source.)
do xml-parse scan file #args[1] output "%c" done
Earlier versions of OmniMark required the keyword instance
following do xml-parse
when configuring the parser for well-formed parsing. The use of the keyword instance
is now optional.
You may include a DTD in the instance if you wish. If you do, the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.
To have the parser validate an XML document against its DTD, you specify that you are giving the parser a complete XML document, including DTD, using the document
keyword:
do xml-parse document scan file #args[1] output "%c" done
The parser will validate the document against the DTD. You must supply the DTD as part of the source. If the DTD is specified as an external text entity using SYSTEM or PUBLIC you may need to write an external-text-entity
rule to locate the DTD and provide it to the parser.
Suppose you have 20 instances to process, all of which use the same DTD. It is wasteful to parse the same DTD 20 times. To avoid doing this, you can pre-compile the DTD and place it on the built-in shelf xml-dtds
:
do xml-parse document creating xml-dtds {"my-dtd"} scan file "my-dtd.dtd" suppress done
You can then process each instance in turn. The following code assumes you have placed the file names of the instances on a shelf called "my-instances":
repeat over my-instances do xml-parse with xml-dtds {"my-dtd"} scan file my-instances output "%c" done again
If you start an XML parse in the scope of an existing XML parse and you want to use the DTD of the current parse for the nested parse, you can specify that the nested parse use the current DTD:
do xml-parse instance with current xml-dtd scan file my-instances output "%c" done
In some cases you may wish to parse a partial instance, that is, a piece of data comprising an element from a DTD which is not the doctype
element of that DTD. In this case, you can specify the element to be used as the effective doctype
for parsing the data using the document-element
keyword:
do xml-parse instance with document-element "lamb" with xml-dtds {"my-dtd"} scan file "partinst.xml" output "%c" done
By default, OmniMark checks all XML IDREF attributes to make sure they reference a valid ID. This checking may not be appropriate in processing a partial instance. It also takes time. You can turn this checking on and off using with id-checking
followed by a Boolean expression. The following code will parse the specified document without checking IDREFs:
do xml-parse document with id-checking false scan file "my-xml.xml" output "%c" done
The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark lets you process documents in any of these encodings. See Character set encoding for details.
One character encoding issue that arises in markup processing is the question of which character set encoding the parser is to use when resolving numeric character entities. When doing well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. When doing validating parsing, you can select whether to use UTF-8 or Latin-1 encoding for numeric character entities. The default is Latin-1. This has the following consequences:
with utf-8 true
:
process do xml-parse document with utf-8 true scan file "myfile.sgm" output "%c" done