do xml-parse

Syntax

  do xml-parse document (with id-checking Boolean-expression)? (with utf-8 Boolean-expression)? (creating xml-dtds key keyname)?
     scan input source
     action+
  done

  do xml-parse
     scan input source
     action+
  done

  do xml-parse instance (with document-element element-name)?
     (with (xml-dtd key key | current xml-dtd))?
     (with id-checking Boolean-expression)?
     scan input source
     action +
  done

Purpose

You can parse an XML document using OmniMark's integrated XML parser.

You can invoke the XML parser with do xml-parse. To invoke the SGML parser, use do sgml-parse. To use a third party parser, use do markup-parse. The do xml-parse statement prepares the parser to parse a document. Actual parsing begins with a call to the parse continuation operator "%c" from within the do xml-parse block. To prepare the parser for parsing, you must do the following:

specify the type of XML document to be processed - well-formed or valid
specify the DTD to be used, if required
specify the source of the XML data

Well-formed parsing

To configure the parser for well-formed parsing, use the following syntax. (In most of the examples that follow file #args[1] will be used as the source of the document to be parsed. You can use any valid OmniMark source.)

  do xml-parse
   scan file #args[1]
     output "%c"
  done

Earlier versions of OmniMark required the keyword instance following do xml-parse when configuring the parser for well-formed parsing. The use of the keyword instance is now optional.

You may include a DTD in the instance if you wish. If you do, the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.

Validating parsing

To have the parser validate an XML document against its DTD, you specify that you are giving the parser a complete XML document, including DTD, using the document keyword:

  do xml-parse document
   scan file #args[1]
     output "%c"
  done

The parser will validate the document against the DTD. You must supply the DTD as part of the source. If the DTD is specified as an external text entity using SYSTEM or PUBLIC you may need to write an external-text-entity rule to locate the DTD and provide it to the parser.

Validating parsing of multiple documents

Suppose you have 20 instances to process, all of which use the same DTD. It is wasteful to parse the same DTD 20 times. To avoid doing this, you can pre-compile the DTD and place it on the built-in shelf xml-dtds:

  do xml-parse document
     creating xml-dtds {"my-dtd"}
     scan file "my-dtd.dtd"
     suppress
  done

You can then process each instance in turn. The following code assumes you have placed the file names of the instances on a shelf called "my-instances":

  repeat over my-instances
     do xml-parse
        with xml-dtds {"my-dtd"}
        scan file my-instances
        output "%c"
     done
  again

If you start an XML parse in the scope of an existing XML parse and you want to use the DTD of the current parse for the nested parse, you can specify that the nested parse use the current DTD:

  do xml-parse instance
     with current xml-dtd
     scan file my-instances
     output "%c"
  done

Parsing a partial instance

In some cases you may wish to parse a partial instance, that is, a piece of data comprising an element from a DTD which is not the doctype element of that DTD. In this case, you can specify the element to be used as the effective doctype for parsing the data using the document-element keyword:

  do xml-parse instance
     with document-element "lamb"
     with xml-dtds {"my-dtd"}
     scan file "partinst.xml"
     output "%c"
  done

Controlling ID/IDREF checking

By default, OmniMark checks all XML IDREF attributes to make sure they reference a valid ID. This checking may not be appropriate in processing a partial instance. It also takes time. You can turn this checking on and off using with id-checking followed by a Boolean expression. The following code will parse the specified document without checking IDREFs:

  do xml-parse document
     with id-checking false
     scan file "my-xml.xml"
     output "%c"
  done

Handling documents with different character set encodings

The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark lets you process documents in any of these encodings. See Character set encoding for details.

One character encoding issue that arises in markup processing is the question of which character set encoding the parser is to use when resolving numeric character entities. When doing well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. When doing validating parsing, you can select whether to use UTF-8 or Latin-1 encoding for numeric character entities. The default is Latin-1. This has the following consequences:

If your document does not contain character entities greater than 127, you are not affected unless you are using a character encoding that does not correspond with 7-bit ASCII for characters 0 to 127. If you are using such an encoding, you should convert your document to UTF-8 for processing and convert it back afterwards.
If your document is encoded in Latin-1, contains character entities greater than 127, and you are using well-formed parsing, you must convert the document to UTF-8 for processing and convert it back afterwards. Note that if the document does not contain actual characters greater than 127, conversion on input is not necessary as UTF-8 and Latin-1 encodings are identical below 128.
If your document is encoded in UTF-8 and you are using well-formed parsing, you are not affected.
If your document is encoded in UTF-8 and you are using validating parsing, you must tell the parser to interpret numeric character entities using UTF-8. You do this with the modifier with utf-8 true:
```
  process
      do xml-parse document
          with utf-8 true
          scan file "myfile.sgm"
          output "%c"
      done
```
If your document is encoded in Latin-1, contains character entities greater than 127 but less than 256, and you are using validating parsing, you are not affected. Numeric character entities will be output in Latin-1 encoding.
If your document is encoded in Latin-1, contains character entities greater than 256, and you are using validating parsing, you must convert the document to UTF-8 for processing and convert it back afterwards.

[ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ LEGACYLIBRARIES ] [ ERRORS ]

OmniMark 7.1.2 Documentation Generated: June 28, 2005 at 5:45:11 pm
If you have any comments about this section of the documentation, send email to [email protected]