[TransWarp] Towards a query theory, part 3: The Final Algorithm?
Phillip J. Eby
pje at telecommunity.com
Tue Oct 14 20:02:13 EDT 2003
Okay, I'm back, now with a little bit of relational algebra and calculus
under my belt, probably just enough to be dangerous. I'll be changing some
of the terminology I've been using to match the terminology of relational
theory, where doing so will let me be more precise. Notably, what I've
been calling a "table alias" is now a "relation variable", and what I've
been rather interchangeably calling contexts and variables are now "domain
variables", which may be named or anonymous.
Relation variables (RV's)are derived from other relation variables by means
of relational operators (joins, project, select, etc.). In case you're not
familiar with the terms:
select(rv,filter) -- filter rows by a logical expression over the
project(rv,attrs) -- keep only the columns named by 'attrs'
natural_join(rv,rv) -- new RV with all columns from the original RV's,
joining on columns with matching names, if any. (If none, this is a
cartesian product).
rename(rv,oldName,newName) -- RV with renamed column
Note, by the way, that kjbuckets' remap() operation can be used to perform
a rename and project in the same operation. IOW, a more general operator
for both functions is:
remap(rv,graph) -- RV with columns renamed, copied, or eliminated,
according to the info in 'graph', which is a many-to-many mapping.
A "domain variable" (DV) represents a column: it has a name and a
type. For our purposes, every RV is a collection of DV's. A
manually-defined DV will be called a "parameter". Parameters may be
designated as input, output, or correlation. Correlation parameters are
used only to correlate different parts of a query, and may be automatically
generated in the process of creating a query. Output parameters designate
the "result columns" of the query, and can also be used for correlation
within the query.
In what is called the "domain relational calculus", queries are expressed
by a collection of "free variables" (output parameters in our terminology),
over a logical expression about relation variables that the DV's come
from. The logical expression can also be a "quantified expression": an
'exists' or 'all' subquery, listing a set of "bound" or "quantified"
variables that would correspond to automatically generated correlation
parameters.
For our purposes, it suffices to define expression nodes that are
relational operators such as select(), remap(), and natural_join(), in
order to construct a query expression.
The algorithm I'll try to formulate now is the algorithm for transforming a
filter expression (that contains traversals, logical expressions,
comparisons, and parameters) into a query, where the query is a relational
variable that expresses the semantics of the original filter, and has all
of the filter's output parameters in its output columns.
The basic algorithm is that each filter node type will have a
'buildQuery(contextRV,contextDV)' method that will return a query RV. The
'contextRV' is an RV representing the context in which the filter is being
used. 'contextDV' is a column in the contextRV indicating the current DV
to which any comparison operators should be applied. Logical expression
nodes will pass both parameters through to their children unchanged;
traversal operators will pass modified values to their children.
For example, if we create an Employee.where() query, our starting context
RV should represent the Employee relation (i.e. table/view), our context DV
would be the primary key, and we would pass that into the root node of any
filter being applied to Employee, to build a query from it.
The AND() operator builds a query RV by performing a natural join on the
RV's returned by its children (after eliminating duplicate RV's, and
AND()-ing the select() filters of RVs that differ only in their select()
filter).
The NOT() operator negates the select() filter of the RV returned by its child.
The OR() operator constructs an outer join of the RV's returned by its
children, after eliminating duplicate RV's and OR()-ing the select()
filters of RVs that differ only in their select() filter.
Traversal operators (including IN, EXISTS, and aggregates) change what
contextRV and contextDV is passed to child nodes. This is done by creating
a new RV bound to the appropriate table, and then annotating it to
facilitate joins. For example, a simple one-to-one join traversal operator
would create an RV for the destination table, then annotate the join
column(s) as anonymous DV's. The traversal operator then does something
like this:
def buildQuery(self, contextRV, contextDV):
newRV, newDV = self.makeNewRVfor(contextRV,contextDV)
newRV = self.subexpression.buildQuery(newRV, newDV)
return natural_join(newRV, remap(contextRV,self.contextMap))
So, the result is the context RV joined to whatever construct came up from
the subexpression node under the traversal operator. If you think about
this for a minute, you'll see that this is a sort of "depth associative"
accumulation of joins, starting at the lowest subexpression level. Of
course, different traversal operators will do different join types; notably
outer joins, exists, and aggregates have very different RV combinators.
(Note, by the way, that a traversal may simply change the contextDV to a
non-primary-key field, and not do anything else. Only when traversing from
one table to another is the contextRV changed, and join operations performed.)
All in all, this algorithm appears to have very few special cases, if the
input filter is reasonably "normalized" before conversion. An input filter
is normalized if it consists of FEATURE()'s that contain AND()'s that
contain OR()'s, that contain normalized filters. Such construction should
minimize the number of duplicated intermediate RV's created by
subexpression nodes, that then have to be identified and merged higher up
in the tree.
The main special case that still remains is to handle correlations that
span "unnatural" joins, such as EXISTS(), outer joins, inequalities, and
aggregates. This is handled by adding a "ghost column" (actually, an input
parameter) to a returned RV whenever an inequality comparison is made to a
parameter. When higher-level equijoins combine various lower-level RVs
composed of exotic join types, the correlation parameter will become
available from another portion of the query, or else bubble up to the top
of the query as a required input parameter.
Wow. I think we've almost got it -- the main thing left is to identify all
the traversal and join operators, and make sure there are no hidden
gotchas. Then, we need to look at how to "simplify" RV's to drop
table-alias RV's that don't contribute anything to the output RV. (This is
so that we can do things like "virtual views", which are just RV's that
join/remap relational tables to an OO view, but without having to always
join all the tables involved!) Next, we'll need to examine how to take a
"simplified" RV and split it up along database lines, and confirm that we
can execute the resulting query.
Finally, we'll need to establish an API to convert a query into a callable
that's pre-computed and optimized for a specific database, so that it can
be used as a binding on a Specialist to create methods for it. Such
methods are useful for common-case queries that are usually relatively
simple, but are run over and over again. One doesn't want such queries to
have "compilation" overhead.
Oh, and I almost forgot... I need to double-check that SQL can be cleanly
generated from all possible structures created by this algorithm. I
suspect this will not necessarily be the case, since most SQL dialects are
not capable of expressing arbitrary relational calculus operations. But
such queries shouldn't be common, and the same logic used to do
cross-database joins will probably also be usable to combine multiple SQL
queries into a final result.
More information about the PEAK
mailing list