Entities, internal

The OmniMark XML and SGML parsers automatically resolve internal entities. With general text entities, the only kind supported in XML, you cannot tell if a piece of text is the result of an entity being resolved, or if it occurred naturally in the document. With CDATA and SDATA character entities, supported by SGML, it is possible to tell that a piece of text is an entity replacement.

If you are creating XML output from an XML input document, the parser will resolve entities in the input document. If you need equivalent entities in the output document you can create them by matching the replacement text of the entities, and outputting the appropriate entities. You do this with translate rules:

  translate "<"
     output "&lt;"
  
  translate ">"
     output "&gt;"
  
  translate "&"
     output "&amp;"

Matching CDATA and SDATA replacement text

When processing SGML, you can detect if a piece of text is the expansion of a CDATA or SDATA entity using the cdata or sdata pattern modifiers:

  translate cdata "OmniMark Technologies"
     output "&om;"
  
  translate sdata "streaming programming model"
     output "&str;"

The first rule will match the text "OmniMark Technologies" only if it is the expansion of a CDATA entity. The second rule will match "streaming programming model" only if it is the expansion of an SDATA entity.

You can match text that is the replacement of a CDATA or SDATA entity (but not the replacement of a general text entity) using the entity pattern modifier:

  translate "Company: " entity "OmniMark Technologies"

This pattern matches if "OmniMark Technologies" is the replacement of a CDATA or SDATA entity. Note that the pattern does not match if "OmniMark Technologies" is the expansion of a general text entity.

You can also match text only if it is not the replacement of a CDATA or SDATA entity using the pattern modifiers non-cdata and non-sdata respectively:

  translate "Company: " non-cdata "OmniMark Technologies"

This pattern will match if the phrase "OmniMark Technologies" is plain data content or the replacement of an SDATA entity, but not if it is the replacement of a CDATA entity.

You can match text only if it does not contain any SDATA or CDATA entity replacement text using the pattern modifier pcdata.

  translate pcdata "OmniMark Technologies"

Note that cdata, sdata, and entity work by first finding an entity replacement string of the specified type and then testing to see if the text matches the specified pattern. They do not work by matching the text and then looking to see if it is an entity replacement. Thus the pattern following sdata and cdata is scanned as a source in its own right. This means:

A pattern prefixed by cdata, sdata, or entity must match the complete replacement text of a single entity. (It is, in effect, a matches test on the replacement text of the entity.)
You can easily write a rule to capture all entity replacements of a particular type. For instance to capture all CDATA entities you can write: translate cdata any*.

pcdata, non-cdata, and non-sdata patterns treat text that is the replacement of an excluded entity type as inherently unmatchable, even by any. This means that you can write a pattern like the following:

  translate non-sdata any* sdata any*

This pattern will match all the text upto the first SDATA entity replacement, then the entire text of the SDATA replacement text.

Matching CDATA and SDATA entities by name

You can also match entity replacement text based on the name of the entity.

  translate cdata named "om"

The above rule will match any text that is the replacement of a CDATA entity named "om".

As with the matching of replacement text, the pattern following named is the equivalent of a matches test on the name of the element. You can use the pattern to limit which entity names are matched:

  translate entity named letter => name

This rule succeeds for any internal CDATA or SDATA entity whose name consists of a single letter.

You can capture the name of the entity with a pattern variable:

  translate entity named any* => name
  output "&" || name || ";"

This rule matches all CDATA and SDATA entities by name and outputs the equivalent entity.

You can capture both the name and the text to pattern variables like this:

  translate (sdata named any* => name) => text
     output "The value of entity '" || name
         || "' is '" || text || "'."

As in the case for matching the replacement text of an internal CDATA or SDATA entity, the pattern that follows the keyword named must match the whole of an entity's name.

Matching on both name and value

You can match the replacement text of a CDATA or SDATA entity based on both its name and it value. Remember that the entity pattern modifiers work by first identifying an entity replacement string and then testing to see if it meets the specified criteria. Therefore, when you test both the name and the value, the entity must meet both criteria for the pattern as a whole to match. The name test is prefixed by named and the value test is prefixed by valued:

  translate cdata named "om" valued "OmniMark Technologies"

This gives you another way to capture both the name and value of an entity:

  translate cdata named any* => name valued any* => text
     output "The value of entity '" || name
         || "' is '" || text || "'."

Prerequisite Concepts

Entities

Related Topics