[TransWarp] Towards a query theory, part 1: filters and correlation

Tue Oct 14 15:26:58 EDT 2003

At 11:43 AM 10/14/03 -0500, Ian Bicking wrote:
>On Tuesday, October 14, 2003, at 11:12 AM, Phillip J. Eby wrote:
>>Generating SQL from "easy" queries involves a lot of details, but is 
>>straightforward, on the whole.  Mostly it's a process of walking the 
>>query to extract table aliases, joins, and criteria.
>>
>>Here are PEAK's "hard" query requirements:
>>
>>* Support correlations (e.g. the "find employees that lives in the same 
>>city as their supervisor, but was born in a different country" example)
>
>If SQLObject were extended to do this, it might look like:
>
>var = Employee.var
>query = AND(var.address.city == var.supervisor.address.city,
>             var.birthplace.city != var.supervisor.birthplace.city)

Here's an interesting question for you.  How would you know in this case 
that the two 'var.supervisor' references are to the same object?  Yes, I 
know it's a one-to-one relationship here, but if it were a one-to-many, the 
semantics get interesting.  :)

>SQLObject's basic metaphor is one of tightly encapsulated row/objects, so 
>we have to produce objects, not arbitrary tuples/relations.  I.e., we have 
>to find instances of Employee, not just cities, or a combination of 
>employees and their supervisors.

Understood.  For purposes of conceptual simplification, I'm considering 
selecting an object to mean selecting the *primary key* of the object, not 
any of its attributes.  This allows me to translate to "standard" 
relational theory.

>>* Support aggregation (count/min/max/avg) over groups (GROUP BY in SQL)
>
>I would not put this in the query.  It doesn't effect which objects we 
>find, but rather what results we return.  Maybe if we wanted the total 
>salary paid, listed according to supervisor:
>
>var = Employee.var
>select(var.supervisor, groupBy=var.supervisor, aggregate=SUM(var.salary))
>
>Implementing that, though, would be challenging.  Maybe easier if select() 
>was given more explicit information about what it was expected to do.

Indeed.  One of my goals here is to get a SQL-independent query language 
with *at least* the expressive power of SQL.  But, due to a limited time 
budget, I will either crack this problem this week, or else decide on a 
known-to-be-crackable subset of the problem.  Most likely, that would be 
the "easy" subset.

>>* Support cross-database joins, where an object's data resides in more 
>>than one system, or where some of an object's features are the result of 
>>computation in Python over data retrieved from an external source.
>
>Well, this is just a nice feature that's hard.  If you support Python 
>execution you can always do it brute-force.  If you want to do it right 
>you need a query optimizer.  Holding out for a single theory of queries 
>that supports this on top of everything else seems like a hurdle that is 
>too high to leap (at least in one bound).

Well, the published work on Gadfly 2 indicates that if you're willing to 
live with only equijoins and AND() criteria (only allowing OR(), NOT(), and 
inequalities to be applied to a single table at a time), it's actually 
rather simple to support.  The greedy-optimization "find next thing to 
join" algorithm in Aaron's IPC 8 paper looks like just the thing.  He just 
didn't explain how to do cross-table OR/NOT/inequalities, UNION, outer 
joins, aggregation, etc.

I see in principle how to extend this, using a concept of "input tables" 
and the "output table" for a query.  All the input tables from a single 
database can be grouped into a subquery whose output table is then used as 
an input table in the top-level query.  The subquery is then converted into 
the query language of the target database.

It seems to me that the other "missing features" of Gadfly 2 are similarly 
implementable by breaking a query into subqueries that implement the 
feature by combining input tables into a "virtual" input table for the 
containing query.  But, it may be that there are easier ways to do aspects 
of this.

For example, if an OR() contains criteria against more than one table, the 
OR() should be used as a filter against the query's output table.  If 
different branches of the OR() include joins of their own, they have to be 
marked as outer joins.  Single-table inequality comparisons that are not 
children of an OR() expression can be used as filters directly against the 
input table.  Similar restrictions apply to NOT() nodes - if they contain 
subexpressions dealing with more than one table, they have to be applied to 
the output rather than the input.

UNION, EXISTS, outer joins, aggregation, and the like seem to me to require 
"virtual" tables to supply alternate join semantics, however.  For example, 
a UNION table would "join" two input tables by merging them.  An 
aggregation would "flatten" an input table to a single row, or to rows per 
distinct combination of key columns.

For single-database SQL generation, it's sufficient to have data structures 
that encode these concepts, since they can be mapped back to SQL.  For 
multi-database query generation, I think each kind of virtual table would 
need some optimization rules for how it would join itself to other tables 
at the equijoin level.  And, virtual tables that encapsulate tables from 
more than one database are also likely to be "interesting".

Of course, we could consider our top-level output query to be just another 
"virtual table", one that happens to be based on equijoining its input 
tables.  So, its input tables must support whatever interface is needed to 
do an equijoin with that table, just as each other table type's input 
tables must support whatever interface is needed for that kind of 
join/aggregation/whatever.

Hm.  It seems to me that implementing this concept would require, at minimum:

1. An algorithm to map logical/traversal/etc. operators into "table 
operators" (equijoins, outer joins, EXISTS, aggregates, remap, etc.) and 
filters (logical expressions composed of comparisons).

2. An understanding of each table operator to sufficient to allow its 
implementation

3. Defining interfaces that table operators should provide for use by an 
enclosing table operator.

In a later e-mail, I'll begin an analysis of items 1 and 2.