[TransWarp] Towards a query theory, part 5: Object-Relational Mapping

Wed Oct 15 21:25:13 EDT 2003

Let's take a moment and circle back to the place this all began: John 
Landahl's TableDM and discussion of queries.

We want to use the mechanisms described thus far to allow us to map an 
object model onto a relational database, and conversely, to apply 
relational queries to our object models.  Can we do this?

Let's start with filters.  Filters are applied to some starting context: an 
RV representing some data type, such as Employee or Customer.  We've been 
talking about them as though the data type was an actual table, but of 
course it can be any RV...  so the Employee "table" could actually involve 
various joins, column renames, and type conversions.  As relational 
operators are applied to this RV by a filter, the built-in optimization 
will push column removals and criteria downward, and type conversions 
upward.  The end result is a precisely-optimized query *based on the 
underlying relational schema*!

Now you can see why I've been pushing so hard for certain properties in 
this theory, such as a purely functional model in which all objects are 
essentially immutable.  This lets us create *one* RV expression for each 
class in our model, that determines the mapping from the "ideal" 
representation to the actual physical representation.  Because it's 
immutable, we can "share" (or at least reuse) the same instance for every 
query we want to do over that object type.  In principle, we can also 
describe one physical schema in terms of another, which might be useful for 
dealing with name changes in a schema that are required when moving between 
databases with different reserved names.

So let's talk briefly about the "ideal" representation of an object in 
relational form.  It should be a representation that can automatically be 
determined from information in the peak.model metadata for the object 
type.  Presumably, this representation would keep the same column names as 
are used for the corresponding attribute names, except where the attribute 
is a collection, in which case the collection would be represented as 
another "table" with an appropriate join mechanism to the main 
table.  Bidirectional one-to-many relationships would use a standard 
foreign key mechanism, while many-to-many relationships would have a 
(logical) join table.

Given this "ideal" form, it should then be possible to use the relational 
operators to transform the actual database schema into the "ideal" 
form.  We would then need a mechanism to allow us to find the correct RV 
expression for a given type (or perhaps collection) within our 
schema.  Perhaps we could do this via adaptation to a protocol representing 
membership in the target schema.

Thus, to retrieve all the "flat" data for a particular object type, it 
would be necessary only for a DM to execute retrieval of the appropriate RV 
with a 'select()' added to pick the right object(s), and possibly a remap() 
to drop any attributes that should be 
lazily-loaded.  (Hmmm...  lazy-loading is an implementation-side hint, so 
the DM shouldn't know this if it's physical schema-independent.  There's 
probably another adaptation issue present here.)

Anyway, if we also make it possible to apply INSERT, UPDATE, and DELETE 
operations to an RV, then mapping from objects back to the database is also 
doable.  And voila!  Physical storage independence at the logical 
abstraction layer for all possible backends, including ones that span 
different physical databases.

So what's missing?  Well, I just did some handwaving regarding type 
conversions, not to mention INSERT/UPDATE/DELETE.  Let's tackle type 
conversions (aka functions) first.

In my last segment, I mentioned the need to deal with functions.  But I was 
considering them a sidetrack, not something too important for the core 
model.  As it happens, I was wrong.  In order to wrap an O/R mapping into a 
collection of relational operators, it's a critical capability.  Heck, it's 
a critical capability just for writing backend-independent queries that do 
date/time manipulations!  (Most databases have a mutually incompatible 
mishmash of date/time types.)

Let's say I want to do something as simple as finding items billed between 
two dates.  I'd definitely prefer to express the query in terms of Python 
date/time types, specifically whatever ones my logical model uses.  Of 
course, that means I'll need to have conversion functions to map to/from 
the types used by my physical database.  And, if we use the "functions 
float as high as possible" idea from my last posting, then we will only 
convert retrieved data to Python types once it's retrieved from the 
database.  (That is, our top-level RV will convert the values in each row 
returned from the underlying SQL or other query.)

But, what about criteria?  If we only float conversions up in the operator 
tree, then the conversion will become "stuck" at the point where our date 
range conversion occurs, and our query will end up retrieving all data from 
the underlying DB, regardless of billing date, then use Python to convert 
the date field, and then check whether the date is in range.  This isn't 
acceptable for our intended purposes.

What can we do?  Most type conversions *must* take place at the Python 
level, because if the database could convert to the desired type, we could 
just use the right type in the first place!  Since the conversion must take 
place in Python, our only option in searching for a date range is to 
convert the date range to the type expected by the database.  In other 
words, we must perform *inverse type conversion* on all constants and input 
parameters.

The algorithm would probably need to be part of the mechanism used to float 
functions upward.  Whenever we encounter a comparison that would otherwise 
keep the function from floating further upward, we can check if the 
function has an *inverse function*.  If so, we can apply the inverse 
function to the opposite side of the comparator, as long as it is a 
constant or input parameter.  If it's a constant, we can apply it 
immediately and compute the new constant.  If it's a parameter, we need to 
somehow annotate it with the function conversion.

The mechanism needed for this isn't clear to me yet, any more than input 
parameters are really clear to me yet.  Most of yesterday and today's 
architecture is almost at "code-level clarity" (i.e., I could sit down and 
try to write it), but input parameters in general are still an odd 
beast.  I'll come back to them in a minute, after verifying that 
conversions are handled.

Okay, so we can float functions and conversions upward, flipping them to 
their inverse where they touch criteria, so that the maximum number of 
functions appear at the boundary between Python and an underlying DB 
scheme.  (Note that even if a DB implements a function directly, we still 
want it to do the computation at the last possible moment, to minimize the 
number of invocations.)

As an alternate way of looking at functions and type conversion, we could 
perhaps view an invertible function as a kind of virtual table (the way 
Gadfly 2 does).  To implement a function, we could simply join from the 
applicable table to the function's table.  Since joins are flattened to 
whatever extent is possible, and everything else tries to go below joins, 
then the functions will naturally "float to the top".

If a function is represented by joining an RV operator, then our normal 
attempt to push criteria to their most-specific RV will find us comparing 
one of the function's "columns" to our date range.  So, when the functions 
.select() is applied with constant logical expressions on the converted 
date column, the function can optimize this by performing the inverse 
conversion on  the constants, and applying them to the unconverted column.

This is a little bit messy, since we really want the criteria to move back 
to one of the columns joined with the conversion table.  There may be a 
clean way to do this in the logic that moves criteria down from a join, or 
more likely, in the logic that moves remaps (projections) downward.

When we move a remap down through a join, we could drop the conversion 
tables and copy their criteria back to the column(s) joined on.  However, 
there would probably need to be a specialized join type for this, since 
under normal circumstances you can't just toss out a joined table just 
because you're not retrieving any non-joined columns.  That *is* safe with 
outer joins, though, so we can model functions and type conversions as an 
outer join to a virtual table.

So now we can have a standard RV type representing a function or type 
conversion.  If the function is supported by a database to which the query 
is applied, it turn the outer join and virtual table back into a function 
application.  I'm not 100% clear on how that will work; we'll look at it 
again when we look at SQL generation algorithms.  Outer join syntax (and 
even the ability to do them in the first place) often varies quite a bit 
between databases, though, so it's quite possible that the table->function 
restoration operation will be *easier* than the code generation for *real* 
outer joins!

By the way, in case you're wondering why I opted to explore the virtual 
table and outer join route, it's for two reasons:

1. It eliminates the need to somehow "mark a parameter as needing type 
conversion", and

2. It allows us to express queries on *expressions* and *object methods* -- 
even ones with parameters -- as part of our fundamental query language!

You see, now that we know how to express a function as a table, and apply 
it to a column by way of an outer join, we are not limited to simple 
two-column conversion functions, oh no.  We can define any object method 
M(self,p1,p2,...pN) as a table with N+1 columns, e.g. 
(Result,self,p1,p2,...,pN).  If there is a native implementation of this 
method for a given back-end, we can provide it as a usable base RV.  If 
not, then a Python implementation of the function can be provided.

In practice, you will not often use this method aspect, and when you do, 
the RV method table will need to also include columns for any other input 
data used by the method.  (That is, the 'self' column has to be replaced 
with a collection of columns representing other attributes used to 
implement the method.)

But, you *can* use this for expressions.  Addition, subtraction, 
multiplication, division, etc.  Once again, outer joins to the virtual 
tables, that SQL generation can convert back to infix operators.  And get 
this...  if you search for 'Some.field+1 == 2', it'll get converted 
internally to 'Some.field == 1', because of the constant-folding mechanism 
described previously.  Pretty cool, eh?

So let's see if input parameters are cleaner now.  Presumably, each RV will 
simply include its parameters as part of its column list.  The parameters 
would come from joined tables and from parameters used by selection 
criteria.  Since parameter-based criteria float down to the table they 
apply to, input parameters will first appear at their usage point, and be 
copied upwards to the top of the query.

At any point, a parameter may be "required" (i.e. the RV can't execute 
without a value for it), or "optional" (it can be specified, but the RV can 
run without it.)  When a join pairs an optional and a required parameter, 
the parameter is optional with respect to the joined RV, so parameters stop 
being required as soon as they can be joined with another column that's 
labelled the same.  When this happens, the criteria that were applied to 
the parameter in the table where it was "required" should be rephrased as a 
join condition at the point where the parameter became optional.

Suppose we treated parameters as virtual tables, too?  Then, when we 
compare to a parameter (as an inequality), this is the same as creating a 
join between the current RV and the parameter table, and the criterion 
stays at the join.  If we compare equal to a parameter, that's the same as 
a remap equating the column to the parameter, which is the "optional" form 
of parameter.  As joins are flattened into larger joins, the "required" 
parameter table will move up in the join, until/unless it is joined with 
its "optional" form, at which point the inequality can be changed to 
reflect this.

Let's see how this would work in our old "same city different country" 
query from part 1...  beginning with the conceptual form:

Employee1
     + - lives in City1
     + - was born in Country1
     + - supervises Employee2
                       + - lives in City1
                       + - was born in Country2
                                          + - is not Country1

Restated as a "friendly" Python form:

aCity = City.where()
aCountry = Country.where()

coResidentForeignSupervisors = Employee.where(
     livesIn = aCity,
     bornIn  = aCountry,
     supervises = Employee.where(
         livesIn = aCity,
         bornIn  = IS_NOT(aCountry)
     )
)

And restated as a primitive filter:

crfs = AND(
     FEATURE('livesIn', EQ(VAR('aCity'))),
     FEATURE('bornIn', EQ(VAR('aCountry'))),
     FEATURE('supervises',
         AND(
             FEATURE('livesIn', EQ(VAR('aCity'))),
             FEATURE('bornIn', IS_NOT(VAR('aCountry'))),
         )
     )
)

Let's convert that to relational operators now, by calling 
crfs.buildQuery() with an "Employee" RV...

First, the AND() passes the employee RV down to each feature's 
buildQuery().  livesIn and bornIn both map to the same RV, and just tag the 
appropriate columns as being the named parameters.  The 'supervises' 
feature takes another reference to the Employee RV (it's a self-join), and 
remaps it so that the 'supervisor' column is also labelled with an 
autogenerated correlation parameter name.  It then passes this relabelled 
Employee RV down to the next AND() object.

The inner AND() object passes the down to the two features, which of course 
apply to the same table.  The EQ() labels 'livesIn' with 'aCity', but the 
IS_NOT() can't do this for 'aCountry', so it joins to a parameter table for 
'aCountry', using an IS_NOT join criterion, and then returns the joined 
(and relabelled) table.

This table is returned by AND() to the FEATURE('supervises'), which now 
joins it with the original Employee RV that's been annotated with the 
'aCity'/'aCountry' parameter names.  So, our structure looks like:

supervisor = remap(EmployeeRV, aCity='livesIn', aCountry='bornIn', ...)

supervised = theta_join(
     IS_NOT('bornIn','aCountry'),
     remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
     parameterTable('aCountry')
)

the_query = natural_join(
     remap(supervisor, privateCorrelation='employeeId',...),
     supervised
)

The '...' in the remaps are to indicate that all other columns are kept as 
is, e.g. foo=foo, bar=bar, for all columns previously present.  The final 
step before execution is to apply a remap that deletes all non-output columns.

Okay, so when we join the supervisor and supervised intermediate RV's, we 
are going to combine the natural join and the theta join, ending up in a 
theta join that looks like:

theta_join(
     AND(
         IS_NOT('t2.bornIn','t3.aCountry'),
         EQ('t1.livesIn', 't2.livesIn'),
         EQ('t1.employeeId','t2.supervisor')
         EQ('t1.bornIn', 't3.aCountry')
     ),
     remap(supervisor, privateCorrelation='employeeId',...),
     remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
     parameterTable('aCountry'),
)

At this point, we can drop the parameter table from the join, resulting in:

theta_join(
     AND(
         IS_NOT('t2.bornIn','t1.bornIn'),
         EQ('t1.livesIn', 't2.livesIn'),
         EQ('t1.employeeId','t2.supervisor')
     ),
     remap(supervisor, privateCorrelation='employeeId',...),
     remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
)

Which is just what we wanted.  So it's doable, although far from obvious 
how to fit this cleanly into the framework so it happens 
automatically.  Oddly, this still seems cleaner to me than special "input 
parameter" columns, because this approach naturally causes the parameter to 
float up with its criteria still attached, putting it in a good position to 
be combined with another instance of the parameter.  The "input parameter" 
approach seems like it forces you to reach down into an underlying RV and 
yank out its criteria.

One other result of going through this exercise, by the way, is the 
realization that our filter-to-relational conversion algorithm isn't quite 
finished.  The algorithm in part 3 had a slightly different notion of what 
constituted an RV, from the very precise model we have now.  Traversals 
need to operate a little differently, and there are some steps that were 
left out.  For example, each traversal that generates a new "table alias" 
has to put a remap on it so its columns are named with an 'alias.column' 
pattern.

The new twists mean that the so-called "Final Algorithm" needs to be 
revised.  Part of the problem is that our filters concept mixes relational 
operators in a somewhat unnatural fashion.  We need to separate the join 
structure from the criteria structure, so that buildQuery() returns three 
things:

1. an RV
2. a logical expression that should be applied to the RV() as a .select()
3. a list of output parameters, that should be used to .remap() the RV, 
leaving only the outputs in the result

Only the "original" caller of buildQuery() should actually apply the remap 
or select, because the criteria or parameter list may otherwise be incomplete.

So, now traversal operators can modify the RV as it goes down the tree, and 
should provide context to their children that tells them what alias name(s) 
to use for columns.  Logical operators combine their children's logical 
expressions according to the appropriate operator, and pass the other two 
items back up unchanged.  Actually, it's more like:

outparms = {}
terms = []
for child in self.children:
     contextRV, logex, parms = child.buildQuery(contextRV, contextDV)
     terms.append(logex)
     outparms.update(parms)

return contextRV, self.__class__(*tuple(terms)), outparms

That is, we keep each child's version of the RV as the version we pass to 
the next child, and we keep all the parameters from each child.  Notice 
that contextDV remains unchanged throughout, because a logical operator has 
no effect on the current traversal position.

Also, what I previously called the contextDV is probably actually going to 
be some sort of "schema traverser" object that understands how to find the 
right column for a feature, and/or create or join to a new table 
alias.  This traverser would know what its current table alias name is so 
it can generate column names.  Traversal operators would act on this 
traverser, to get a new traverser to pass to child expressions, and modify 
the RV they pass down accordingly.  Traversals only occur downward, so 
traversal ops throw away the new traverser when they return, and they 
return their child's return values unchanged.

Comparison operators create a version of themselves with an explicit 
operand, using the current traverser (contextDV).  They return this as 
their 'logex' return value, leaving the contextRV unchanged in the 
return.  (Unless they are an equality comparison and the RHS is a 
parameter, in which case they remap the contextRV to include the equality.)

VAR() operators remap their contextRV to indicate that the current column 
is equal to the variable, then pass it down to their child.  On the return, 
if the VAR() is flagged for 'show', then they also add the variable name to 
the output parameter list.

And so it goes, until the entire logical structure is created.  So, our 
previous example should return:

rv = theta_join(
     AND(
         EQ('t1.employeeId','t2.supervisor'),
         EQ('t1.livesIn','t2.livesIn'), # courtesy of matching labels
     ),
     EmployeeRV, aCity='livesIn', aCountry='bornIn', ...)
     EmployeeRV, aCity='livesIn', ...)
)

logex = AND(
     EQ('t1.livesIn', VAR('aCity')),
     EQ('t1.bornIn', VAR('aCountry'))
     EQ('t2.livesIn', VAR('aCity')),
     IS_NOT('t2.bornIn',VAR('aCountry')),
)

outparms = {}

We now perform 'rv.select(logex)' which will apply the criteria as close to 
the underlying tables as possible.  Specifically, it will push the criteria 
into the theta join's AND(), and the theta_join will push down any criteria 
that point to a single table, after attempting to unify any parameters with 
the same table as the opposite side of the criterion, or failing that, any 
table that provides the parameter.  If the parameter is not present, then 
it's an error: the parameter was never defined in a way that was usable.

So, by waiting to apply selection criteria until the entire join structure 
is complete, we don't have to do anything funky to unify correlation or 
input parameters.  No crazy joins, no special handling to detect required 
vs. optional parameters, nothing.  Very cool.

We still need our traversal system to detect when it should autogenerate an 
EXISTS() join, but that's practically an implementation detail.  Functions 
and data type conversions are part of the join structure, and so are 
effectively part of traversal.  Any comparison applied to a function output 
gets the right explicit operand for the generated name of the output 
column.  Then, when the criteria are applied to the final join structure, 
the pushdown and possible constant-folding occurs.

So what have I left out?  INSERT/UPDATE/DELETE, of course, but also some 
details of the constant-folding mechanism for function 
tables.  Specifically, the algorithm I've given so far for doing it, only 
works if the converted value isn't used in the output.  So really, there's 
going to need to be special handling so that when the value *is* used in 
the output (e.g we request the billing date, *and* that it be between two 
dates), the constant-folding still takes place and the criteria move to the 
original table column.

But I think that's a subject to tackle in a future missive, where I'll deal 
with traversal operators, join construction, and join folding in more detail.