You can use the rtf
markup parser function to parse an RTF document. Like an XML document, an RTF
document consists of markup and data content. The rtf
markup parser function maps the markup structures of
RTF to OmniMark's markup rules according to the rules detailed
below.
The following program illustrates the operation of the rtf
markup parser function by implementing a
crude RTF to XML converter:
import "omrtf.xmd" unprefixed process do markup-parse rtf scan file #args[1] output "%c" done element #implied output "<%q" repeat over attributes as a output " " || key of attribute a || '="%v(a)"' again output ">%c</%q>%n" translate "<" output "<" translate ">" output ">" translate "&" output "&"
The only markup rules rules fired by the rtf
markup
parser function are:
element
,
data-content
, and
translate
.
RTF commands are translated into markup events, according to the following rules.
{
), it starts an element:
the RTF command is the element name, and the content is the content of that RTF group. If the RTF command
has a numeric value specified, this value is provided as the value of the element's value
attribute. If the RTF group in question is marked ignorable (that is, using the \*
designator),
an attribute named ignorable
is provided, with a value of yes
. Both the
value
and ignorable
attributes are of type cdata
. The element type is
any
.
/>
), with its value, if any, provided by the value
attribute. The
attribute types are as above, and the element type is empty
.
\u
(UNICODE character) command is treated specially. It is treated as an empty
element, with the UNICODE character number as its value
attribute. In addition, it has an
alt
attribute, which contains the alternative text provided immediately following the
\u
command. The alt
attribute is of type cdata
.
group_
, with the content of the group as its content. The only possible attribute
for this element is the ignorable
one described above, if the {
is followed by
\*
. The _
in the name of this element and following ones is used so that the command
does not conflict, or potentially conflict, with any RTF command.
\_
or
\~
, is provided as an element with the name special_
, and with the name
as
the value of the value
attribute.
\
followed by a new line, which is an alternative to the
\par
command, is provided as a par_
element.
binary
(with the \bin
command). Hexadecimal data that uses the \'xx
RTF
command is returned as data content. However, there are RTF commands whose content
is implicitly
hexadecimal. In the latter case, the hexadecimal data is made available as is—the RTF parser has no
special knowledge of these commands.
The OMRTF library is based on version 1.7 of the RTF spec, according to
which, the RTF command names that are destinations are the following:
aftncn aftnsep aftnsepc annotation atnauthor atndate atnicn atnid atnparent atnref atntime atrfend atrfstart author background bkmkend bkmkstart buptim category colortbl comment company creatim datafield do doccomm docvar dptxbxtext falt fchars ffdeftext ffentrymcr ffexitmcr ffformat ffhelptext ffl ffname ffstattext field file filetbl fldinst fldrslt fldtype fname fontemb fontfile fonttbl footer footerf footerl footerr footnote formfield ftncn ftnsep ftnsepc g generator gridtbl header headerf headerl headerr htmltag info keycode keywords lchars levelnumbers lfolevel list listlevel listname listoverride listoverridetable listpicture listtable listtext manager mhtmltag nesttableprops nextfile nonesttables objalias objclass objdata object objname objsect objtime oldcprops oldpprops oldsprops oldtprops operator panose pgp pgptbl picprop pict pn pnseclvl pntext pntxta pntxtb printim private pwd pxe result revtbl revtim rsidtbl rtf rxe shp shpinst shppict stylesheet subject tc template title txe ud upr urtf userprops xe
To use OMRTF, you must import it into your program using an import declaration such as:
import "omrtf.xmd" prefixed by rtf.