[TransWarp] Template parsing and XML/HTML processing design
Phillip J. Eby
pje at telecommunity.com
Sat Jul 19 16:21:32 EDT 2003
* We will only support parsing well-formed XML, including XHTML. HTML Tidy
can be used to convert HTML documents, but we will not be including it in
PEAK at this time, as the available Python bindings for tidylib are awkward
to build or redistribute.
* Parser: for full document fidelity, we will need to use/require
Expat. SAX parsing interfaces do not supply DOCTYPE or comments. We
probably need to support pass-through of comments, so that embedded
DOCTYPE for the benefit of modern browsers to know we're using XHTML.
* Tag components: each document element will be mapped to a component,
beginning with a top-level component for the document as a whole. Tags
will support being used as a web behavior (i.e. render() method) and as a
sub-template/view component in another template.
* Sub-templates: when a template is used as a view in another template, it
should be possible to mark the portion(s) of the template to be used as a
subtemplate. In other words, if you have a template page that contains a
head and body, you could mark the body as being the subtemplate, so in the
case where you use that template as a view in another template, the head,
doctype, etc. will not be included in the containing template, which
presumably has its own copies of these things. (Note that for this to work
right, the subtemplate will need to either carry its own xmlns declarations
(if any) or they'll have to match those of the containing template.)
* Tag building interface: tag components will have a tag-building
interface, so that the parser can tell them about their contents. This
interface will probably be adaptable to I_SOXNode_NS or an extension
thereof, so we can reuse some of our existing XML parsing
infrastructure. The interface will also need to have an 'addPattern(name,
component)' method that accepts the contained tags with pattern="name"
attributes, and a way to determine whether the tag is entirely static (i.e.
contains no dynamic content), and if so, retrieve the static XML for that tag.
* Supplying context/properties between tags: if we simply invoke a template
as a view in another template, it will inherit all its properties from the
originating template, rather than from where it is invoked. I'm not sure
if this is really an issue, or whether it will produce any unexpected
results. In effect, it means that a view can only "offer" components or
values to views that are physically in the same page, unless templates used
as views "copy" themselves into the page that uses them. In addition to
lexical context like this, tags may want to supply runtime context data to
I'll have to consider both of these issues further to design the view
invocation interface. Probably, the solution for dynamic context is to
allow passing an "execution context" component from one tag to another at
runtime, using configuration properties to pass values. If a tag has
nothing to add to the property namespace, it just passes the current one on
to its children. I'll treat lexical context of embedded views as a
STASCTAP YAGNI for the framework, because you can always write a custom
view class, and passing lexical properties across templates means you can't
understand a template without understanding the place it's invoked from,
* Parser options: the parser will need to know a number of things:
1. Are we parsing HTML? (if so, certain tags need to be told that they
should be rendered empty, e.g. HR, BR, etc.)
2. Will an XML namespace be required on view/model/pattern
attributes? (this will be optional, for convenience/brevity.)
3. Should the document be rendered with, or without, the added template
markup? (This changes how attribute data is supplied to the tag components.)
4. What component is the parent of the top-level document?
#1 can probably be guessed from the DOCTYPE and/or xmlns declarations. #2
can probably be guessed by the absence of an xmlns declaration for our
namespace URI. #4 is going to get passed in anyway. #3 is hard and maybe
a YAGNI anyway.
Probably what we should do is tell the tags about the added markup, because
it's really up to the tag object *how* to roundtrip the markup. At some
point, perhaps we'll add a property that tags can use to determine whether
they should include the roundtrip data in the output. Actually, this might
make an interesting test/demo for skins, since one could create a
'roundtrip' skin that sets the property to include the extra markup.
Okay, I think that's enough for now. So it looks like the interface to-do
* Tag as builder of a subdocument: addPattern(name,node), addChild(node)
* View (i.e., tag factory): __call__(parent, tagName=, attribItems=,
viewProperty=None, modelPath=None, patternName=None, nonEmpty = False) --
is there anything else that should be there? XML Namespace mapping
information, maybe? Source file/line/column? That'd help with debugging;
maybe tags should set __traceback_info__ to include that info during execution.
* Tag as document node: 'staticText' attribute (=None if dynamic)
* Tag as behavior: render(interaction), which calls...
* Tag as subtemplate: renderTo(interaction, writeFunc, currentModel,
And the concrete classes will need to include Tag, Text, and Literal, where
Text is plain text (and thus needs escaping/entity encoding), and Literals
are used to represent doctype, comments, processing instructions, DTD
definitions, etc. The parser will create Tag, Text, or Literal instances
for everything that isn't marked with a 'view' attribute; for those, the
view name will be looked up as a property on the enclosing tag to retrieve
a view (tag factory) object. It may be that we will want to support
adapting various constant types (e.g. strings, numbers, etc.) to tag
factories, so that skin-supplied property values can be embedded in a
template at compile time. But that's optional.
Rather than hardcode even the Tag, Text, and Literal types, they should
probably be looked up as properties too, although this would probably be
quite slow. What we could do instead, is define required attributes
'tagFactory', 'textFactory', and 'literalFactory' on the "tag as
subdocument builder" interface. These could map to properties, but would
cache the lookup for subsequent and nested textual units, thus avoiding
huge numbers of property lookups going all the way up to the config
root. Anyway, this would let a view redefine how contained text and
literals were turned into objects.
More information about the PEAK