Marked sections

XML documents may contain CDATA marked sections. SGML documents may contain CDATA, RCDATA, IGNORE, and INCLUDE marked sections. OmniMark provides markup rules for handling all these types of marked sections.

CDATA and RCDATA marked sections

CDATA and RCDATA marked sections serve to protect text from being misinterpreted as markup (start tags, end tags, entity references or declarations). These marked sections affect how the data is parsed by the SGML parser, but they do not usually affect the way that OmniMark processes the resulting data content.

marked-section cdata and marked-section rcdata rules can be used to identify content that was wrapped in a CDATA or RCDATA marked section.

It is very important to understand that the presence or absence of marked-section rules does not affect how marked sections are treated by the SGML parser. They only determine how the SGML parser presents the resulting text to OmniMark.

A similar set of statements applies to CDATA and RCDATA marked sections as applies to IGNORE marked sections. The major difference is that the "default" processing for CDATA and RCDATA marked section is to treat their text content as data content, and not to discard it.

  • If an OmniMark program contains no marked-section cdata or marked-section rcdata rules, then OmniMark treats the text resulting from these marked sections as if the text resulted from ordinary data content. In other words, OmniMark does not detect the boundaries between the text originating from inside the marked section and the text originating from outside the marked section.
  • Only one marked-section cdata rule may be selected for a CDATA marked section. That is, either there must only be one marked-section cdata rule or, if there is more than one such rule, each must have a condition. Similarly, only one marked-section rcdata rule may be selected for an RCDATA marked section. It is an error for more than one marked-section cdata or marked-section rcdata rule to be selected for a CDATA or an RCDATA marked section.
  • The %c operator captures the text of a CDATA or RCDATA marked section. Either %c or suppress must be used exactly once in a marked-section cdata or marked-section rcdata rule. All modifiers supported by %c can be used on a %c operator in a marked-section cdata or marked-section rcdata rule.
  • The text of a CDATA or RCDATA marked section consists of all the characters between the "[" following the keyword CDATA or RCDATA and the "]]> ", not including the surrounding delimiters, but including any record ends or white space within the marked section.
  • All SGML comments in the header of a CDATA or RCDATA marked section are processed prior to the processing of the marked section.
  • Only marked sections in the document instance are available for processing by an OmniMark program. Marked sections in the DTD are always ignored, whether or not there is any marked-section rule in the OmniMark program.
  • The setting of the sgml-out action determines what happens to record ends in the text of a CDATA and RCDATA marked section.

IGNORE marked sections

IGNORE marked sections appear to an OmniMark program in the same way as SGML comments do, except that they are processed using a marked-section ignore rule rather than an sgml-comment rule.

OmniMark programmers should note that, in keeping with the provisions of clause 10.4.1 of the SGML standard (ISO 8879:1986), all pairs of "<[" and "]]>" within an IGNORE marked section are matched and treated as text. This means that any marked sections nested within an IGNORE marked section, including the opening and closing delimiters, are treated as part of the text of the IGNORE marked section.

The text of an IGNORE marked section consists of all the characters between the DSO delimiter following the status keyword specification, and the marked section end (that is, between the "[" following the keyword IGNORE and the "]]>"). The text does not include the surrounding delimiters, but does include any record ends or white space within the marked section.

Any SGML comment in the header of an IGNORE marked section is processed prior to the processing of the IGNORE marked section.

Only marked sections in the document instance are available for processing by an OmniMark program. Marked sections in the DTD are always ignored, whether or not there is any marked-section rule in the OmniMark program.

The setting of the sgml-out action determines what happens to record ends in the text of an IGNORE marked section.

The presence of marked-section ignore rules affects how translate rules match text in and around an IGNORE marked section.

INCLUDE marked sections

SGML comments and ignore, cdata, and rcdata marked sections are all processed similarly. However, include marked sections require quite a different approach. Instead of having one rule to process an include marked section, OmniMark provides two: one for processing the start of a marked section and one for the end. This split is necessary because, unlike other types of marked sections, an include marked section can start in the context of one element and end in another, and so can overlap the hierarchical structure that ties the components of a parsed SGML document together.

This kind of overlapping cannot happen with ignore, cdata, and rcdata marked sections because they inhibit the recognition of other markup, including start and end tags, within their text. An important consequence of this is that the whole of the text of an ignore, cdata, or rcdata marked section is processed with one set of output streams (as used by the output action and as available using the #current-output stream set) and inherits the stream destinations and stream modifiers from the element or data-content rule that processes the surrounding content.

The contents of an include marked section, can be part of one or more elements, the element and data-content rules for which each may specify different output destinations and stream modifiers. To avoid all the complexity and user confusion that could result from trying to "merge" the specifications of the rules for include marked sections and the applicable element and data-content rules, include marked section rules only apply to the start and end of an include marked section. The include marked section's rules have no direct influence on the processing of the marked section's content. The two rules are the marked-section include-start and marked-section include-end rules.

The OmniMark program can influence the processing of the content of an include marked section by setting global variables and testing them in element and data-content rules, so that those rules can detect when they occur in an include marked section.

This is an example of an INCLUDE marked section overlapping the element structure of a document:

  <title>Part of the title.
  <![INCLUDE[More of the title.
  <p>The first paragraph.
  <p>Part of the second paragraph.
  ]] More of the second paragraph.