xerces-xml

function

Library: Xerces (OMXERCES legacy)
Include: omxerces.xin

Declaration

define external markup source function 
   xerces-xml    schemas value integer       schema-validation-mode optional initial {xerces-auto-schema-validation}
              namespaces value integer       namespace-processing-mode optional initial {xerces-namespaces}
                    scan value string source data

Purpose

Invoking the Xerces-Based XML Parser -- The Basics

The Xerces-Based XML Parser is invoked in a manner similar to OmniMark's built-in SGML and XML parsers, differing only in specifying "markup-parse xerces-xml" instead of "xml-parse document" or something like it, as above.

markup-parse tells OmniMark to invoke an external markup parser, and xerces-xml tells OmniMark which external markup parser it is.

The scan argument is used for the same purpose as with OmniMark's built-in parsers to prove input to the parser. It can be a string, a file, the invocation of an external string source function, or the invocation of an input-function.

The omxerces.xin library defines the xerces markup parser function and the information that can be passed to it on invocation. The Xerces xml markup parser function returns the Xerces-based markup parser to OmniMark, and then OmniMark uses it within the do markup-parse action.

The Xerces-based markup parser invokes element and other OmniMark rules in the same manner as OmniMark's built-in parsers. The information available in those rules differs from that provided by OmniMark's built-in parsers in some respects, as described below.

Invoking the Xerces-Based XML Parser -- Further Options

The Xerces-Based XML Parser takes two optional arguments, schemas and namespaces, which control the W3C schema and XML namespace processing done by the Xerces markup parser.

omxerces.xin defines named values for use with the schemas argument:

xerces-no-schemas says do no schema processing.
xerces-no-schema-validation says do schema processing but don't do any schema validation.
xerces-auto-schema-validation says do schema processing, but only do schema validation if any internal/external DTD subset is found in the parsed document. This is the default value for the schemas argument.
xerces-schema-validation says always do schema processing and schema validation.
xerces-full-schema-validation says always do schema processing and schema validation. Additionally it says do full schema constraint checking. (The Apache documentation says: "Full schema constraint checking, including checking which may be time-consuming or memory intensive. Currently, particle unique attribution constraint checking and particle derivation restriction checking are controlled by this option.") If xerces-auto-schema-validation or xerces-schema-validation is specified, partial constraint checking is done.

omxerces.xin also defines named values for use with the namespaces argument:

xerces-no-namespaces says that the markup parser should not do any namespace processing.
xerces-namespaces says that the markup parser should do namespace processing.
xerces-xmlns says that the markup parser should do namespace processing. This is the default value for the schemas argument.

The namespaces argument doesn't affect OmniMark namespace processing, which is done independently of namespace processing done by the markup parser. The easiest way of distinguishing the two is to observe that the markup parser is responsible for namespace validation, and OmniMark is responsible for making use of the namespace information.

If xerces-namespaces is specified, then OmniMark namespace processing is suppressed, because xerces-namespaces specifies that the xmlns attributes, that are used by OmniMark namespace processing, are not to be returned from the Xerces-based markup parser. As a consequence, you should generally specify xerces-no-namespaces if you want no namespace validation (even if you want OmniMark namespace processing), and specify, or allow to default, xerces-xmlns if you want namespace validation.

Example

    process
       do markup-parse xerces-xml
          schemas xerces-no-schemas ;the Xerces parser will do no schema processing
          namespaces xerces-no-namespaces ;the Xerces parser will do no namespace processing
           scan file "my.xml"
       done

What You Get from the Xerces-Based XML Parser

The following is a list of what information is available to an OmniMark program from version 1.0 of the Xerces-based XML markup parser. In particular, it describes both new things that the xerces markup parser does for OmniMark programs, and limitations as compared to using the built-in OmniMark markup parsers.

W3C Schemas

The Xerces-Based XML Parser processes and validates W3C Schemas, and what's returned to the OmniMark program is based on how the document is interpreted by any schema used. The most noticeable effect of using a schema is in:

the warnings and errors reported,
the interpretation of entities,
default attribute values, and
the recognition of ignorable whitespace (see below).

Schemas are read into an OmniMark program as external text entities. You can use the external-text-entity rule to control how schemas are found and pre-processed by an OmniMark program.

No information from a schema or from a DTD is available to the OmniMark program, even though it's used by the markup parser in interpreting the document.

External Text Entities

For version 1.0 of the Xerces-based XML markup parser, external-text-entity rules are only provided with the system identifier and public identifier (if any), provided for the entity. The entity's name and other identity is not available. (A name is available, but it's an invented name like "#SAXENTITY1", not the entity's true name.)

W3C schemas are read by OmniMark as external entities and are passed to external-text-entity rules in the same manner as other external entities. If you want to distinguish between schemas and real entities, you can look at the system identifier (which is a file name in many cases), which may tell you what you want to know.

Element Information

Elements have names and attributes in the usual manner. However, they are all seen by OmniMark as having an ANY content model, and all attributes are seen as being of CDATA type and "specified". EMPTY elements are not identified ("is empty" is always false) either based on the elements' declarations or on whether they use a "/>"-ending start tag.

You can't usefully repeat over an attribute values -- because they are seen as CDATA, they always have just one item.

Ignorable Whitespace

OmniMark sees what XML defines as "ignorable whitespace" as the contents of a markup-section ignore. The markup-section ignore rule can be used to capture ignorable white space. If there's no markup-section ignore rule, then ignorable whitespace is ignored.

Processing Instructions

Processing instructions are returned to the process-instruction rule in the normal manner with one exception: the XML Declaration, which is encoded as a processing instruction starting with <?xml, is used by the markup parser and is not returned to the OmniMark program.

Errors and Warnings

Errors and warnings from the Xerces-based markup parser are returned to OmniMark as errors and warnings in the same manner as for its built-in markup parsers.

The only difference is that there is just one numeric exception code for errors and one for warnings, so the numeric exception code cannot be used to distinguish between different kinds of errors.

The Xerces-based markup parser does not normally stop when it encounters what the Xerces XML parser considers a fatal error -- it keeps on going. This is normally appropriate, because the Xerces XML parser can recover from most errors. However, there are cases in which it cannot recover. In these cases it is possible to have an error multiply reported.

To prevent run-away reporting of errors, the Xerces-based markup parser terminates if it encounters 5 fatal errors in a row, without other information intervening.

What Rules Are Fired

The following lists all the OmniMark markup parser rules that are of use with the Xerces-based markup parser:

Other Library Functions

xerces-library-version
xerces-xml