[TransWarp] Towards a query theory, part 3: The Final Algorithm?

Tue Oct 14 20:02:13 EDT 2003

Okay, I'm back, now with a little bit of relational algebra and calculus 
under my belt, probably just enough to be dangerous.  I'll be changing some 
of the terminology I've been using to match the terminology of relational 
theory, where doing so will let me be more precise.  Notably, what I've 
been calling a "table alias" is now a "relation variable", and what I've 
been rather interchangeably calling contexts and variables are now "domain 
variables", which may be named or anonymous.

Relation variables (RV's)are derived from other relation variables by means 
of relational operators (joins, project, select, etc.).  In case you're not 
familiar with the terms:

select(rv,filter) -- filter rows by a logical expression over the

project(rv,attrs) -- keep only the columns named by 'attrs'

natural_join(rv,rv) -- new RV with all columns from the original RV's, 
joining on columns with matching names, if any.  (If none, this is a 
cartesian product).

rename(rv,oldName,newName) -- RV with renamed column

Note, by the way, that kjbuckets' remap() operation can be used to perform 
a rename and project in the same operation.  IOW, a more general operator 
for both functions is:

remap(rv,graph) -- RV with columns renamed, copied, or eliminated, 
according to the info in 'graph', which is a many-to-many mapping.

A "domain variable" (DV) represents a column: it has a name and a 
type.  For our purposes, every RV is a collection of DV's.  A 
manually-defined DV will be called a "parameter".  Parameters may be 
designated as input, output, or correlation.  Correlation parameters are 
used only to correlate different parts of a query, and may be automatically 
generated in the process of creating a query.  Output parameters designate 
the "result columns" of the query, and can also be used for correlation 
within the query.

In what is called the "domain relational calculus", queries are expressed 
by a collection of "free variables" (output parameters in our terminology), 
over a logical expression about relation variables that the DV's come 
from.  The logical expression can also be a "quantified expression": an 
'exists' or 'all' subquery, listing a set of "bound" or "quantified" 
variables that would correspond to automatically generated correlation 
parameters.

For our purposes, it suffices to define expression nodes that are 
relational operators such as select(), remap(), and natural_join(), in 
order to construct a query expression.

The algorithm I'll try to formulate now is the algorithm for transforming a 
filter expression (that contains traversals, logical expressions, 
comparisons, and parameters) into a query, where the query is a relational 
variable that expresses the semantics of the original filter, and has all 
of the filter's output parameters in its output columns.

The basic algorithm is that each filter node type will have a 
'buildQuery(contextRV,contextDV)' method that will return a query RV.  The 
'contextRV' is an RV representing the context in which the filter is being 
used.  'contextDV' is a column in the contextRV indicating the current DV 
to which any comparison operators should be applied.  Logical expression 
nodes will pass both parameters through to their children unchanged; 
traversal operators will pass modified values to their children.

For example, if we create an Employee.where() query, our starting context 
RV should represent the Employee relation (i.e. table/view), our context DV 
would be the primary key, and we would pass that into the root node of any 
filter being applied to Employee, to build a query from it.

The AND() operator builds a query RV by performing a natural join on the 
RV's returned by its children (after eliminating duplicate RV's, and 
AND()-ing the select() filters of RVs that differ only in their select() 
filter).

The NOT() operator negates the select() filter of the RV returned by its child.

The OR() operator constructs an outer join of the RV's returned by its 
children, after eliminating duplicate RV's and OR()-ing the select() 
filters of RVs that differ only in their select() filter.

Traversal operators (including IN, EXISTS, and aggregates) change what 
contextRV and contextDV is passed to child nodes.  This is done by creating 
a new RV bound to the appropriate table, and then annotating it to 
facilitate joins.  For example, a simple one-to-one join traversal operator 
would create an RV for the destination table, then annotate the join 
column(s) as anonymous DV's.  The traversal operator then does something 
like this:

def buildQuery(self, contextRV, contextDV):

     newRV, newDV = self.makeNewRVfor(contextRV,contextDV)
     newRV = self.subexpression.buildQuery(newRV, newDV)

     return natural_join(newRV, remap(contextRV,self.contextMap))

So, the result is the context RV joined to whatever construct came up from 
the subexpression node under the traversal operator.  If you think about 
this for a minute, you'll see that this is a sort of "depth associative" 
accumulation of joins, starting at the lowest subexpression level.  Of 
course, different traversal operators will do different join types; notably 
outer joins, exists, and aggregates have very different RV combinators.

(Note, by the way, that a traversal may simply change the contextDV to a 
non-primary-key field, and not do anything else.  Only when traversing from 
one table to another is the contextRV changed, and join operations performed.)

All in all, this algorithm appears to have very few special cases, if the 
input filter is reasonably "normalized" before conversion.  An input filter 
is normalized if it consists of FEATURE()'s that contain AND()'s that 
contain OR()'s, that contain normalized filters.  Such construction should 
minimize the number of duplicated intermediate RV's created by 
subexpression nodes, that then have to be identified and merged higher up 
in the tree.

The main special case that still remains is to handle correlations that 
span "unnatural" joins, such as EXISTS(), outer joins, inequalities, and 
aggregates.  This is handled by adding a "ghost column" (actually, an input 
parameter) to a returned RV whenever an inequality comparison is made to a 
parameter.  When higher-level equijoins combine various lower-level RVs 
composed of exotic join types, the correlation parameter will become 
available from another portion of the query, or else bubble up to the top 
of the query as a required input parameter.

Wow.  I think we've almost got it -- the main thing left is to identify all 
the traversal and join operators, and make sure there are no hidden 
gotchas.  Then, we need to look at how to "simplify" RV's to drop 
table-alias RV's that don't contribute anything to the output RV.  (This is 
so that we can do things like "virtual views", which are just RV's that 
join/remap relational tables to an OO view, but without having to always 
join all the tables involved!)  Next, we'll need to examine how to take a 
"simplified" RV and split it up along database lines, and confirm that we 
can execute the resulting query.

Finally, we'll need to establish an API to convert a query into a callable 
that's pre-computed and optimized for a specific database, so that it can 
be used as a binding on a Specialist to create methods for it.  Such 
methods are useful for common-case queries that are usually relatively 
simple, but are run over and over again.  One doesn't want such queries to 
have "compilation" overhead.

Oh, and I almost forgot...  I need to double-check that SQL can be cleanly 
generated from all possible structures created by this algorithm.  I 
suspect this will not necessarily be the case, since most SQL dialects are 
not capable of expressing arbitrary relational calculus operations.  But 
such queries shouldn't be common, and the same logic used to do 
cross-database joins will probably also be usable to combine multiple SQL 
queries into a final result.