Rich Text Format (RTF) (OMRTF)
You can use the rtf
markup parser function to parse an RTF document. Like an XML document, an RTF
document consists of markup and data content. The rtf
markup parser function maps the markup structures
of RTF to OmniMark's markup rules according to the rules
detailed below.
The following program illustrates the operation of the rtf
markup parser function by implementing a
crude RTF to XML converter:
import "omrtf.xmd" unprefixed
process
do markup-parse rtf scan file #args[1]
output "%c"
done
element #implied
output "<%q"
repeat over attributes as a
output " " || key of attribute a || '="%v(a)"'
again
output ">%c</%q>%n"
translate "<"
output "<"
translate ">"
output ">"
translate "&"
output "&"
How RTF structures are mapped to markup rules
The only markup rules rules fired by the rtf
markup
parser function are:
RTF commands are translated into markup events, as follows:
- When an RTF command designated by the RTF 1.7 Specification as a destination command (see list below)
appears at the start of an RTF group (that is, immediately following {), it starts an element:
the RTF command is the element name, and the content is the content of that RTF group. If the RTF command
has a numeric value specified, this value is provided as the value of the element's value
attribute. If the RTF group in question is marked ignorable (that is, using the \* designator),
an attribute named ignorable is provided, with a value of yes. Both the value and ignorable attributes are of type cdata. The element type is any.
- When an RTF command is not specified as a destination, but appears at the start of a group with an
ignorable designator, then it is treated as if it were a destination, in the same manner as discussed
above.
- An RTF command not specified or treated as a destination is an empty element (as if its tag were
ended with />), with its value, if any, provided by the value attribute. The
attribute types are as above, and the element type is empty.
- The \u (UNICODE character) command is treated specially. It is treated as an empty
element, with the UNICODE character number as its value attribute. In addition, it has an
alt attribute, which contains the alternative text provided immediately following the \u command. The alt attribute is of type cdata.
- An RTF group not recognized as grouping the content of an RTF destination is provided as an element,
with the name group_, with the content of the group as its content. The only possible
attribute for this element is the ignorable one described above, if the { is
followed by \*. The _ in the name of this element and following ones is used so that
the command does not conflict, or potentially conflict, with any RTF command.
- An RTF command whose name is a special character rather than a name, such as \_ or
\~, is provided as an element with the name special_, and with the name
as the value of the value attribute.
- An RTF comand that consists of \ followed by a new line, which is an alternative to the
\par command, is provided as a par_ element.
- Line ends, which are intended to be ignored by RTF readers, are not returned by the RTF parser.
- All other data is returned as data content, including that encoded in the RTF document as binary (with the \bin command). Hexadecimal data that uses the \'xx RTF command
is returned as data content. However, there are RTF commands whose content is implicitly
hexadecimal. In the latter case, the hexadecimal data is made available as is—the RTF parser has no
special knowledge of these commands.
RTF commands considered destinations
The OMRTF library is based on version 1.7 of the RTF spec, according to
which, the RTF command names that are destinations are the following:
aftncn
aftnsep
aftnsepc
annotation
atnauthor
atndate
atnicn
atnid
atnparent
atnref
atntime
atrfend
atrfstart
author
background
bkmkend
bkmkstart
buptim
category
colortbl
comment
company
creatim
datafield
do
doccomm
docvar
dptxbxtext
falt
fchars
ffdeftext
ffentrymcr
ffexitmcr
ffformat
ffhelptext
ffl
ffname
ffstattext
field
file
filetbl
fldinst
fldrslt
fldtype
fname
fontemb
fontfile
fonttbl
footer
footerf
footerl
footerr
footnote
formfield
ftncn
ftnsep
ftnsepc
g
generator
gridtbl
header
headerf
headerl
headerr
htmltag
info
keycode
keywords
lchars
levelnumbers
lfolevel
list
listlevel
listname
listoverride
listoverridetable
listpicture
listtable
listtext
manager
mhtmltag
nesttableprops
nextfile
nonesttables
objalias
objclass
objdata
object
objname
objsect
objtime
oldcprops
oldpprops
oldsprops
oldtprops
operator
panose
pgp
pgptbl
picprop
pict
pn
pnseclvl
pntext
pntxta
pntxtb
printim
private
pwd
pxe
result
revtbl
revtim
rsidtbl
rtf
rxe
shp
shpinst
shppict
stylesheet
subject
tc
template
title
txe
ud
upr
urtf
userprops
xe