[PEAK] Organizing documentation of Python code

Wed Sep 22 14:13:01 EDT 2004

My personal inclination is to prefer to document Python code in-line, 
rather than writing separate documentation, wherever possible.  It's easier 
to update when you change something, and there's less chance of forgetting 
to update things.

But, Python has some limitations when it comes to producing this kind of 
documentation, compared to say, Perl.  Perl's POD format makes it easy to 
use a more "literate programming" style, where you group elements together 
logically, interspersing code and documentation to tell the "story" of the 
module.

In principle, you can do this with Python, but in practice, it only works 
if you read the original source code.  This is because of the two kinds of 
documentation generation tools for Python, both suck at literate 
programming.  :)

Specifically, the two kinds are object-extracting tools and source-reading 
tools.  Object-extracting tools like 'pydoc' and 'epydoc' are the most 
common.  They basically import the module(s) to be documented, and then 
inspect the objects to pull out documentation.  Their typical flaw is that 
they only understand relatively few kinds of objects, and produce ludicrous 
results when trying to document PEAK's descriptors, metaclasses, 
interfaces, and so on.  So, they lack flexibility and extensibility.

A second, perhaps even more fatal flaw of these tools is that they document 
things "as-is" from the point of view of Python.  By that, I mean that they 
are limited by the structure of the objects formed by the code, and do not 
see the underlying organization of a module or class.  I usually group 
related classes together in a module, and related methods or attributes in 
a class.  This information is lost when you extract data from a module or 
class dictionary, instead of reading the source code.

But source-reading tools are few and far between, because you have to be 
able to write a parser or figure out how to use Python's internal parser in 
order to write one.  Really, HappyDoc is the only practical source-reading 
tool out there that I know of.  But even though it reads the source, it 
*also* discards the ordering information, and doesn't understand 
descriptors or interfaces or metaclasses.

Finally, neither kind of tool understands PEAK's API organization, where 
users are generally discouraged against directly importing from a package's 
contained modules.  In PEAK, we don't necessarily want to document where an 
API is coded, but rather where it should be used from.

For a long time, I've considered writing a documentation tool, but I've 
always been hung up on the issue of writing a suitable source parser, 
because I assumed that was the only way to get the sequence/grouping 
information needed.  But, a source-reading tool is also inherently limited 
in flexibility by what it can parse.  So, I've never really taken any steps 
to implement it.

What I realized from today's exchange with Duncan about a "quick reference" 
generator, is that there's a third alternative: create a documentation API 
that's invoked within a module, to bypass the limitations of today's 
object-extracting tools.

Specifically, Python module and class dictionaries don't preserve sequence 
or grouping, and some objects currently can't be documented.  (For example, 
you can't document integer or string constants in a module, except by 
putting text in the module docstring.)  Last, but not least, such tools 
localize documentation in the "defining" module, rather than the 
"exporting" module (e.g. PEAK's API mdoules), and can't handle esoteric 
relationships such as "X implements Y" or "A requires B".

So here's the idea: create a "documentation kernel" API that does the 
following things:

* Allows you to register docstrings for otherwise-undocumentable items in a 
namespace

* Allows you to create categorized indexes of a namespace's contents

* Allows you to nest indexes (e.g. class within a module)

* Allows you to record arbitrary metadata and relationships between items 
(using synthetic keys to represent the objects, to avoid object lifetime/GC 
issues)

* Allows to to direct the documentation for a given namespace to actually 
appear in another namespace, while retaining identity as to the defining 
namespace

Then, build a "convenience API" that lets you easily use short function 
calls in a module body to classify the module or class contents, attach doc 
strings to attributes and constants, etc., indicate that certain items are 
part of a given interface's implementation, etc.

Then, expand that convenience API by *invoking it from PEAK's own API's*, 
so that documentation for a PEAK component can include, for example, what 
configuration keys it offers or requires.  In other words,  when a PEAK 
component is defined, it would invoke the documentation API to record this 
sort of metadata.

Because this approach is based on a simple kernel API that just records 
information, it should be easy to add in to any new APIs.  For example, 
peak.security APIs could record security documentation, and so on.

Finally, we could then write documentation formatters, that simply take 
data from the documentation indexes, and output it in different 
formats.  Such formatters would need to be configurable as to what metadata 
they extract, of course, and what they do with it, especially for 
relationship data (e.g. inheritance trees).

For some objects, it's probably better not to insert documentation API 
calls into them directly, but instead to process them only when 
documentation is needed.  I think that to do this, you'd basically go 
through all the modules you were documenting, and adapt each target object 
to a "documentation contributor" interface, and ask the object to update 
the appropriate indexes.

Indeed, that approach seems particularly useful for documenting object 
relationships, since the extent of a given set of relationships can't be 
known until all the relevant objects have been imported.  (For example, the 
fact that Adpater X adapts from type Y to interface Z can't be documented 
unless X, Y, and Z are all in memory.)  Also, using the actual objects to 
manage relationships would mean we wouldn't need synthetic keys, or 
otherwise have to deal with relationship data unless a documentation tool 
is actually being run.

Also, if instead of using adapters to do this generation, we used generic 
functions, it would also be possible to define optional new kinds of 
additional metadata.  For example, one could add an extra generation rule 
to scan a function's bytecode or source code for exceptions raised, and 
then create a link between the function object and the exception object in 
the relationship index.  However, the documentation formatter's 
configuration could determine whether this additional rule would ever be 
used, e.g.:

     [when("cfg.allows('function-raises') and item in FunctionType")]
     def generate_doc(cfg, item, index):
         # etc.

...thus avoiding doing complex generation of unneeded data.

Hm.  Really, I guess the pre-generation API only needs to deal with data 
that can't reasonably be extracted after-the-fact, like sequencing, 
grouping, and undocumentables.

Anyway, the basic idea here is that I think this is a way to get some of 
the features in a documentation system, that I previously thought could 
only be obtained by parsing source code.  I'm going to have to give some 
more thought to the specifics of the indexing scheme, particularly with 
respect to how categories should be grouped at various levels, what are 
global (systemwide) vs. local (package-specific) categories, and so on.  To 
do this, I'll need to think about different kinds of output we'd want to 
have, and then determine what kind of indexes are needed to support them.

I think it's also possible we'll want a way to incorporate external text 
files into a larger documentation scheme, such that a combination 
developer's guide and API reference could be generated from the code plus 
the external files.

Such a tool would be useful not only for PEAK itself, but applications 
developed with PEAK.  After all, an "enterprise" toolkit ought to make it 
easy for teams within an organization to share and maintain code.

Of course, as I mentioned earlier, none of these ideas should stop us from 
getting a quick-and-dirty "quick reference" generator up and running, 
especially since I have so many other items before this one on my to-do 
list.  I just wanted to record this idea for future reference.