swirl
Guide to OmniMark 8   OmniMark home
docs home 
IndexConceptsTasksSyntaxLibrariesLegacy LibrariesErrors
 
  Related Syntax   Related Concepts  
control structure  

do xml-parse

 
 

Syntax

do xml-parse document (with id-checking Boolean-expression)? (with utf-8 Boolean-expression)? (creating xml-dtds key keyname)?
   scan input source 
   action+
done

do xml-parse 
   scan input source 
   action+
done

do xml-parse instance (with document-element element-name)? 
   (with (xml-dtds key key | current xml-dtds))?
   (with id-checking Boolean-expression)?
   scan input source 
   action +
done


Purpose

You can parse an XML document using OmniMark's integrated XML parser.

You can invoke the XML parser with do xml-parse. To invoke the SGML parser, use do sgml-parse. To use a third party parser, use do markup-parse. The do xml-parse statement prepares the parser to parse a document. Actual parsing begins with a call to the parse continuation operator "%c" from within the do xml-parse block. To prepare the parser for parsing, you must do the following:

  1. specify the type of XML document to be processed - well-formed or valid
  2. specify the DTD to be used, if required
  3. specify the source of the XML data

Well-formed parsing

To configure the parser for well-formed parsing, use the following syntax. (In most of the examples that follow file #args[1] will be used as the source of the document to be parsed. You can use any valid OmniMark source.)

  do xml-parse
   scan file #args[1]
     output "%c"
  done

Earlier versions of OmniMark required the keyword instance following do xml-parse when configuring the parser for well-formed parsing. The use of the keyword instance is now optional.

You may include a DTD in the instance if you wish. If you do, the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.

Validating parsing

To have the parser validate an XML document against its DTD, you specify that you are giving the parser a complete XML document, including DTD, using the document keyword:

  do xml-parse document
   scan file #args[1]
     output "%c"
  done

The parser will validate the document against the DTD. You must supply the DTD as part of the source. If the DTD is specified as an external text entity using SYSTEM or PUBLIC you may need to write an external-text-entity rule to locate the DTD and provide it to the parser.

Validating parsing of multiple documents

Suppose you have 20 instances to process, all of which use the same DTD. It is wasteful to parse the same DTD 20 times. To avoid doing this, you can pre-compile the DTD and place it on the built-in shelf xml-dtds:

  do xml-parse document
     creating xml-dtds {"my-dtd"}
     scan file "my-dtd.dtd"
     suppress
  done

You can then process each instance in turn. The following code assumes you have placed the file names of the instances on a shelf called "my-instances":

  repeat over my-instances
     do xml-parse instance
        with xml-dtds {"my-dtd"}
        scan file my-instances
        output "%c"
     done
  again

If you start an XML parse in the scope of an existing XML parse and you want to use the DTD of the current parse for the nested parse, you can specify that the nested parse use the current DTD:

  do xml-parse instance 
     with current xml-dtd
     scan file my-instances 
     output "%c"
  done

Parsing a partial instance

In some cases you may wish to parse a partial instance, that is, a piece of data comprising an element from a DTD which is not the doctype element of that DTD. In this case, you can specify the element to be used as the effective doctype for parsing the data using the document-element keyword:

  do xml-parse instance
     with document-element "lamb"
     with xml-dtds {"my-dtd"}
     scan file "partinst.xml"
     output "%c"
  done

Controlling ID/IDREF checking

By default, OmniMark checks all XML IDREF attributes to make sure they reference a valid ID. This checking may not be appropriate in processing a partial instance. It also takes time. You can turn this checking on and off using with id-checking followed by a Boolean expression. The following code will parse the specified document without checking IDREFs:

  do xml-parse document
     with id-checking false
     scan file "my-xml.xml"
     output "%c"
  done

Handling documents with different character set encodings

The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark lets you process documents in any of these encodings. See Character set encoding for details.

One character encoding issue that arises in markup processing is the question of which character set encoding the parser is to use when resolving numeric character entities. When doing well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. When doing validating parsing, you can select whether to use UTF-8 or Latin-1 encoding for numeric character entities. The default is Latin-1. This has the following consequences:

    Related Syntax
   #current-output
   creating
   document-end
   document-start
   external-text-entity
   find-end
   find-start
   suppress
   xml-dtds
 
Related Concepts
   Co-routines, managing
   Input
   XML DTDs: creating
   XML/SGML parsing: built-in shelves
 
 

Top [ INDEX ] [ CONCEPTS ] [ TASKS ] [ SYNTAX ] [ LIBRARIES ] [ LEGACY LIBRARIES ] [ ERRORS ]

OmniMark 8.2.0 Documentation Generated: March 13, 2008 at 3:33:48 pm
If you have any comments about this section of the documentation, please use this form.

Copyright © Stilo International plc, 1988-2008.