You can use the RTF parser library to parse an RTF document. Like an XML document, an RTF document consists of markup and data content. The OmniMark RTF parser library maps the markup structures of RTF to OmniMarks markup rules according to the rules detailed below. When you use the RTF parser, you launch it just like an external XML parser, using do markup-parse
and you access the structure and data content of the RTF file just as you would with an XML file -- by writing markup rules.
The following program illustrates the operation of the RTF parser by implementing a crude RTF to XML converter:
import "omrtf.xmd" unprefixed
process
do markup-parse rtf
scan file #args[1]
output "%c"
done
element #implied
output "<%q"
repeat over attributes as a
output " " || key of attribute a || '="%v(a)"'
again
output ">%c</%q>%n"
translate "<"
output "<"
translate ">"
output ">"
translate "&"
output "&"
How RTF structures are mapped to markup rules
The only markup rules rules fired by the RTF parser are:
RTF commands are translated into XML-like elements, as follows:
- When an RTF command designated by the RTF 1.7 Specification as a
"destination" command (see list below) appears at the start of an RTF group (i.e.
immediately following "{"), it starts an element, with the RTF command
as its element name, and its content is the content of that RTF group.
If the RTF command has a numeric value specified, this value is provided
as the value of the element's "value" attribute.
If the RTF group in question is marked "ignorable" (i.e. using the "\*"
designator), an attribute is provided named "ignorable", with a value of
"yes".
Both the "value" and "ignorable" attributes are of type "CDATA". The
element type is "ANY".
- When an RTF command isn't specified as a destination, but does
appear at the start of a group with an "ignorable" designator, then it's
treated as if it were a destination, in the above manner.
- An RTF command not specified or treated as a destination is an "empty
tag" element (as if its tag were ended with "/>"), with its value, if
any, provided by the "value" attribute.
The attribute types are as above, and the element type is "EMPTY".
- The "\u" (UNICODE character) command is treated specially. It's an
"empty tag" element, with the UNICODE character number as its "value"
attribute. In addition, it has an "alt" attribute, which contains the
"alternative" text provided immediately following the "\u" command.
The "alt" attribute is of type "CDATA".
- An RTF group not recognized as grouping the content of an RTF
destination is provided as an element, with the name "group_", with the
content of the group as its content. The only possible attribute for
this element is the "ignorable" one described above, if the "{" is
followed by "\*".
The "_" in the name of this element and following ones is used so that
the command doesn't conflict, or potentially conflict, with any RTF
command.
- An RTF command whose name is a special character rather than a name,
such as "\_" or "\~", is provided as an element with the name
"special_", and with the "name" as the value of the "value" attribute.
- An RTF comand that consists of "\" followed by a new line, which is
an alternative to the "\par" command, is provided as a "par_" element.
- Line ends, which are intended to be ignored by RTF readers, are not
returned by the RTF parser.
- All other data is returned as data content, including that encoded in
the RTF document as "binary" (with the "\bin" command).
Hexadecimal data that uses the "\'xx" RTF command is returned as data
content. However, there are RTF commands whose "content" is implicitly
hexadecimal. In the latter case, the hexadecimal data is made available
"as is" -- the RTF parser has no special knowledge of these commands.
RTF commands considered destinations
The OMRTF library is based on version 1.7 of the RTF spec, according to which, the RTF command names that are "destinations" are the following:
aftncn
aftnsep
aftnsepc
annotation
atnauthor
atndate
atnicn
atnid
atnparent
atnref
atntime
atrfend
atrfstart
author
background
bkmkend
bkmkstart
buptim
category
colortbl
comment
company
creatim
datafield
do
doccomm
docvar
dptxbxtext
falt
fchars
ffdeftext
ffentrymcr
ffexitmcr
ffformat
ffhelptext
ffl
ffname
ffstattext
field
file
filetbl
fldinst
fldrslt
fldtype
fname
fontemb
fontfile
fonttbl
footer
footerf
footerl
footerr
footnote
formfield
ftncn
ftnsep
ftnsepc
g
generator
gridtbl
header
headerf
headerl
headerr
htmltag
info
keycode
keywords
lchars
levelnumbers
lfolevel
list
listlevel
listname
listoverride
listoverridetable
listpicture
listtable
listtext
manager
mhtmltag
nesttableprops
nextfile
nonesttables
objalias
objclass
objdata
object
objname
objsect
objtime
oldcprops
oldpprops
oldsprops
oldtprops
operator
panose
pgp
pgptbl
picprop
pict
pn
pnseclvl
pntext
pntxta
pntxtb
printim
private
pwd
pxe
result
revtbl
revtim
rsidtbl
rtf
rxe
shp
shpinst
shppict
stylesheet
subject
tc
template
title
txe
ud
upr
urtf
userprops
xe
Functions
omrtf-version
rtf
|
Platforms
HP/UX
HP-UX Itanium 2
IBM AIX 5.3
IBM OS/2
Linux (Intel)
Linux Red Hat Enterprise 5
MS Windows 98/ME
MS Windows NT/2000/XP
MS Windows Vista
Sun Solaris 8
|
OmniMark 8.2.0 Documentation Generated: May 6, 2008 at 10:12:26 am
If you have any comments about this section of the documentation, please use this form.
Copyright © Stilo International plc, 1988-2008.