Aggregates in QUEL

                         Document #: US-13262,EN
------------------------------------------------------------------------------

Major subject: quel    Minor subjects: quel_syntax

Keywords: aggregates, quel, tech_notes

Abstract:
Aggregates in QUEL: Equivalent to Release 6 Technical Notes 
#22 and #23, or Release 5 Technical Notes #29 and #30.


Expert note:
Aggregates in QUEL
==================

Overview
--------
Aggregates are a powerful construct in INGRES  to  summarize  data.  The
purpose  of  this document is  to explain how to use aggregates
effectively through the use of example queries. To restrict the size  of
this  document,  no data is presented. Readers are urged to set up their own
relations to try out the examples and check the results.

Aggregates are of two main  types,  simple  aggregates  and  partitioned
aggregates  (these  are  referred  to  as  'aggregate  functions' in the
manual).  Simple aggregates return a single value which can  be  listed,
referenced in a where clause or embedded in expressions and treated like
any other numeric value.  Partitioned  aggregates  (those  with  a  BY
clause)  return  a  set  of  values and most often this set of values is
'joined' back to a base relation by the  query  referencing  the  parti-
tioned aggregate.

o	Aggregate Syntax
o	Aggregates in the WHERE Clause
o	Aggregates With an Embedded WHERE Clause
o	Simple Aggregates with Joins
o	The  Aggregate 'Rules'
o	Aggregate Functions (or Partitioned Aggregates)
o	Partitioned Aggregates in the WHERE Clause
o	Partitioned Aggregates with Embedded WHERE Clauses
o	Multicolumn and Aggregate BY Clauses
o	The ANY Aggregate
o	Aggregates and Joins
o	Summary


Aggregate Syntax
----------------
An aggregate has the following syntax:

     AGG_OP (AGG_EXPR [by BY_EXPR] [where BOOL_EXPR] )

Simple aggregates are those that do not have a BY_EXPR.

e.g.    AVG(r.salary WHERE r.age>50)

here the AGG_OP  is  AVG  (the  averaging  aggregate)  the  AGG_EXPR  is
r.salary and the BOOL_EXPR is r.age>50.

Aggregates can  be  used  in  appends,  replaces,  deletes  as  well  as
retrievals.   Also  they can be used in view definitions to define 'vir-
tual' columns involving calculations that are done  dynamically  as  the
view  is  referenced.   This  note  will  concentrate  on examples using
retrieves.

The AGG_OPs available are

     COUNT MIN MAX AVG SUM ANY

Also there are 'prime' AGG_OPs  that only consider unique values of  the
argument(AGG_EXPR):

     COUNTU SUMU AVGU

Note:   "COUNT" counts the existence of rows. The column  'counted'  may
be of any datatype.

"ANY" is a special case of COUNT that returns  the  value  1  if  a  row
exists,  or zero if it does not exist. It is a more efficient version of
count since it stops counting as soon as the count becomes 1.

"MIN" and "MAX" work on  all  datatypes  including  date  and  character
strings but in most cases will require a scan of the relation (unless it
is a qualified aggregate).

"AVG" and "SUM" work only on numeric datatypes (integer,float,money)

Consider the following relation (range variable e):

     employees(name,dept,age,salary)

which lists employees names with the department they work in, their  age
and current salary.

Consider the following sequence of queries:

What is average age of all employees?

     RETRIEVE(a=AVG(e.age))

How many years difference between the oldest and youngest employee?

     RETRIEVE(diff=(MAX(e.age)-MIN(e.age)) )

The prime aggregates are used to count only unique values of  the  argu-
ment:

How many departments are there?

     RETRIEVE(c=COUNTU(e.dept))

Using count above would simply give the number of employees, not depart-
ments.

Aggregates can also be used in expressions:

What is the highest ratio of salary to age?

     RETRIEVE(m=MAX(e.salary/e.age))


Aggregates in the WHERE Clause
------------------------------
Moving onto more involved queries where aggregates are used to  restrict
which rows are returned by the query:

Who earns the maximum salary?

     RETRIEVE(e.name) WHERE e.salary=MAX(e.salary)

The principal rule here is that the aggregate gets evaluated once,  then
is  used  as  a constant in the where clause. Consequently, the relation
would be visited twice to execute this query.

By how much does each employees earn more than the average?

     RETRIEVE(e.name,diff=e.salary-AVG(e.salary))
             where e.salary>AVG(e.salary)

In the above query, simple aggregates can be used in  calculations  like
any  other expression. Note that since QUEL does not allow 'local' vari-
ables the reference to 'AVG(e.salary)' has to be made twice. However the
aggregate  optimizer  can  recognize  this and execute it once only (but
only if they are at the same nesting 'level').

Who earns a median salary (closest to average)?

     RETRIEVE(e.name) WHERE abs((e.salary-AVG(e.salary)))
                             =MIN(e.salary-AVG(e.salary))

The average would need to be evaluated twice in the above query  because
it  appears  at different levels of nesting in different expressions. It
would improve performance to compute it first then embed its value  into
the query.


Aggregates With an Embedded WHERE Clause
----------------------------------------
A WHERE clause embedded in an aggregate restricts the rows (hence  the
values) being considered by the aggregate.

List the average age of those employees earning more than $40000 :

     RETRIEVE(a=AVG(e.age WHERE e.salary>40000))

NOT : RETRIEVE (a=AVG(e.age)) WHERE e.salary>40000

It is important to understand here why  the  WHERE  clause  outside  the
aggregate  will  not  produce meaningful results. The aggregate  will be
calculated first independently of the WHERE clause and its value will be
listed  if there exists a person earning more than $40000. In the former
(correct) query, the WHERE clause is inside the aggregate so influencing
the aggregate calculation.

What is the average salary of those employees that are  aged  more  than
the average age of all employees?

      RETRIEVE(a=AVG(e.salary WHERE e.age>AVG(e.age)))

In the above case one aggregate is 'nested' within another. Nesting  can
occur  to  any  depth and the first rule of aggregates is that innermost
aggregates are evaluated first.

Introducing a  new column 'sex' into the employee  relation  with  value
'f' or 'm' :

How many male employees are there and how many female employees?

     RETRIEVE (num_male=COUNT(e.name WHERE e.sex='m'),
             num_female=COUNT(e.name WHERE e.sex='f'))

This query will involve a double scan of the  relation  and  is  usually
better posed using a BY clause:

     RETRIEVE (e.sex,c=COUNT(e.name BY e.sex))



Simple Aggregates with Joins
----------------------------
In addition to the employee table, consider the  table  (range  variable
c):

     children(pname,cname,age)

which contains 1 row/child for any employee with a family (assume it  is
not possible that both parents in a family are employees).  Pname refers
to the parent's name in the employee table and cname is the  first  name
of each child. The age of each child is also listed.

How many of the parents in the employee relation have children?

     RETRIEVE (cu=COUNTu(c.pname))

does the trick but doesn't check with the employee relation. To do  this
use:

     RETRIEVE (cu=COUNTu(e.name WHERE e.name=c.pname))

Here the aggregate should be viewed as a temporary retrieval first e.g.

     RETRIEVE into temp(e.name) WHERE e.name=c.pname

then the query becomes

     RETRIEVE (cu=COUNTu(temp.name))

Countu must be used because  the  temporary  relation  does  NOT  remove
duplicates  and so would reference a parent multiple times if the parent
had more than 1 child.

How many parents earning more than the average salary have children aged
under 3 ?

     RETRIEVE(cu=
             COUNTu(e.name WHERE e.name=c.pname AND	 
                                 c.age<3 AND
                                 e.salary>(AVG(e.salary)))

How many employees have no children ?

     RETRIEVE (n=(COUNT(e.name)-COUNTu(e.name WHERE e.name=c.pname)))

However,

Which employees have more than 2 children?

     RETRIEVE (e.name) WHERE COUNT(c.cname WHERE e.name=c.pname)>2

doesn't work because the COUNT will only be executed once and  if  there
is  at least 1 family with more than 2 children then the WHERE clause is
true and consequently all of the employee names will be listed !

To do this type of query one needs the  BY  clause  in  the  query  to
'link' the aggregate to the target list:

     RETRIEVE(e.name) WHERE COUNT(c.cname BY e.name WHERE e.name=c.pname)>2

Remember:
--------

1. In a nested aggregate, the aggregates are evaluated from  the  inner-
most on outwards.

2. Duplicates are not removed if a join is specified in the WHERE clause
within an aggregate.


PARTITIONED or AGGREGATE FUNCTIONS with BY CLAUSES
--------------------------------------------------

More powerful classes of query can be handled by making use of the  BY
clause in an aggregate.  These queries are known as 'partitioned' aggregates. 
These are also referred to in the documentation as 'aggregate functions'.

Partitioned aggregates give one the power to pose very  complex  queries
in a very succinct manner. Two main principles must be kept in mind when
using them: the concept of 'visibility' of the BY clause and the  fact
that  the WHERE clause within the aggregate affects the aggregate cal-
culation, not the partitions. These and other issues will be illustrated
through a series of examples.



The  Aggregate 'Rules'
----------------------
1. In a nested aggregate, the aggregates are evaluated from  the  inner-
most  outward. When aggregates exist at the same level of nesting, it is
the optimizer (JOINOP) and the aggregate optimizer that decide the order
of execution.

2. Duplicates are not removed if a join  is  specified  in  the  WHERE
clause of an aggregate.

3. The variable(s) in the BY clause are visible to the  outer  expres-
sion.   (this  visibility applies to BY clauses involving other parti-
tioned aggregates.)

4. If the BY clause variable(s) are visible to the  outer  expression,
it  is joined back to a 'duplicate-free projection' of the outer expres-
sion.

5. The WHERE clause in a partitioned aggregate restricts the aggregate
calculation,  NOT the partitions. (However 'set aggregate noproject' can
be used to override this)


Using BY clauses.
----------------

The use of BY clauses will be illustrated through  examples.  To  keep
the  size down no data is listed.  Readers are therefore urged to
set up their own relations to test the examples.

The examples continue by making use of the 2 relations explained earlier.

     employees(name,dept,age,salary)         (range variable 'e')
     children (pname,cname,age)              (range variable 'c')


Aggregate Functions (or Partitioned Aggregates)
-----------------------------------------------
These are aggregates that have a BY clause  in  them  and  so  do  not
return  a single value, but a set of values  (like a temporary relation).
Using these requires a little more care than using simple aggregates but
by following a few basic rules, can be quickly mastered.

List the average salary for each dept

     RETRIEVE(a=AVG(e.salary BY e.dept))

This query involves generating a temporary relation that has one row for
each  dept in the 'e' relation, then averages the salary for all rows in
'e' that have that dept, i.e. we are summarizing the data in  e  'parti-
tioned' by each dept. However, as the query stands above, only the aver-
ages will be listed.  Normally, to  obtain  more  meaningful  results, we
include  the  BY  argument  in the target list so as to indicate which
average refers to which dept partition.

     RETRIEVE (e.dept,a=AVG(e.salary BY e.dept))
                 |                       |
                 +------linkage----------+

Here we introduce the concept of  'visibility'  of  the  BY  argument,
visibility  here  meaning that its values can be 'linked' or 'joined' to
the outside of the aggregate.

The important rules to remember here are rules 3 and 4:

The variable(s) in the by clause are visible to the outer expression and
refer to a duplicate-free projection of the outer expression.

This query effectively lists the BY value with each  average.  However
if  more  information is requested in the target list, then the aggregate
temporary relation is effectively 'joined' to a  duplicate-free  projec-
tion of the other target list columns.

     RETRIEVE(e.age,e.dept,a=AVG(e.salary BY e.dept))

will be executed as follows: the temporary aggregate will be calculated.

Also a duplicate free list of (age,dept) pairs will be generated and

i.e. the query could be viewed as the following sequence of operations:

     RETRIEVE into t1 (e.dept,a=AVG(e.salary BY e.dept))
     RETRIEVE into t2 (e.age,e.dept)              /* the 'projection' */
     MODIFY t2 to heapsort                   /* to remove duplicate rows*/

     RETRIEVE (t2.all,t1.a) WHERE t1.dept=t2.dept

If e.dept is left out of the target list:

RETRIEVE(e.age,a=AVG(e.salary BY e.dept))

the result would have the same as above except that the column dept will
be  omitted,  as though the join on dept still took place, but no listing
of the dept appears in the result:

     RETRIEVE (t2.age,t1.a) WHERE t1.dept=t2.dept


Partitioned Aggregates in the WHERE Clause
------------------------------------------
In much the same way as BY clauses are linked within the target  list,
they  can  be  linked  between  the target list and a WHERE clause and
between aggregates within the WHERE clause.

Which depts have an average salary greater than $30000 ?

     RETRIEVE(e.dept) WHERE AVG(e.salary BY e.dept) > 30000
                |                            |
                +--------linkage-------------+

List the average salary for the above depts

     RETRIEVE(e.dept,a=AVG(e.salary BY e.dept))
                      WHERE AVG(e.salary BY e.dept) >30000

One would like to be able to use the WHERE clause '... WHERE a >  30000'
but  local  variables  cannot  be used in QUEL expressions, however, the
aggregate optimizer recognizes that this expression is repeated  at  the
same level of nesting and only evaluates it once.

The above query could be viewed as the following sequence of operations:

     RETRIEVE into t1(e.dept,a=AVG(e.salary BY e.dept))
     RETRIEVE(t1.dept,t1.a) WHERE t1.a > 30000

List the average salary for those people who  work  in  depts  where  at
least  2000  other  employees  work and where the maximum salary is less
than $40000

     RETRIEVE (a=AVG(p.salary BY p.dept)) WHERE COUNT(p.name BY p.dept) > 2000
                             AND     MAX(p.salary BY p.dept) < 40000

List those departments WHERE the salary is the same for all employees

     RETRIEVE(e.dept) WHERE MIN(e.salary BY e.dept)=MAX(e.salary BY e.dept)

This class of query is interesting because  two  partitioned  aggregates
are compared row by row for each dept.


Partitioned Aggregates with Embedded WHERE Clauses
--------------------------------------------------
Consider the following (assuming that dept refers to a numeric code):

List those departments whose code is greater than 50  with  the  average
salary for those depts

     RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50))

This class of query introduces rule 5:

The WHERE clause in the aggregate restricts the  aggregate  calculation,
not the BY clause partitions.

In the above query this rule means that the result will be  printed  out
for  ALL  depts in the 'e' relation, but that the average will be listed
as zero for all except for those depts  with a code greater than 50. l  aggre-
clause. To get around this problem one can use one of 2 approaches:

     a) add an outer WHERE clause:

          RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50))
                  WHERE e.dept>50

     or

     b) use a temporary relation:

          RETRIEVE into temp(e.dept,e.salary) WHERE e.salary>50
          RETRIEVE(temp.age,a=AVG(temp.salary BY temp.age))

The second case is more effective when there is a primary  or  secondary
index  on  columns in the WHERE clause and the WHERE clause is very res-
trictive.

The set command (or use ing_set):

     set aggregate noproject

means the WHERE clause will affect the partitions collected  before  the
aggregate calculate, canceling rule 3.

Consider another case:

List all departments and the average salaries of all  employees  with  a
salary greater than 40000 in those depts

     RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.salary>40000))

Here the WHERE clause obviously does not affect the partitions but  will
affect those rows used in the partition.


Multicolumn and Aggregate BY Clauses
------------------------------------
The by clause can contain  multiple  column  references,expressions  and
also aggregates. Consider the following:

List the average salary for each sex in each dept

     RETRIEVE (e.dept,e.sex,a=AVG(e.salary BY e.dept,e.sex))

List the average salary for each dept size

     RETRIEVE(c=count(e.name BY e.dept),
             a=AVG(e.salary BY count(e.name BY e.dept)))

List the average salary for each family size

     RETRIEVE(c=count(c.cname BY e.name WHERE e.name=c.pname),
                     a=AVG(e.salary BY count(c.cname BY e.name
                                             WHERE e.name=c.pname)))

List the average family size

     RETRIEVE(a=AVG(count(c.cname BY e.name WHERE e.name=c.pname))


The ANY Aggregate
-----------------
This is a much more efficient version of count for use in certain situa-
tions

Which employees have children?

     RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=1

Which employees do NOT have children

     RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=0

Counting stops as soon as the count becomes 1 in each case.


Aggregates and Joins
--------------------
Joins are required in the WHERE clause of aggregates when the  columns
referenced  in  the argument of the aggregate and the columns referenced
in the BY expression come from different relations.

List the average age of children of the employees of each department

     RETRIEVE( e.dept,a=AVG(c.age BY e.dept WHERE e.name=c.pname))

which can be viewed as the following sequence of operations:

     RETRIEVE into temp(c.age,e.dept) WHERE e.name=c.pname
     /* note NO duplicates rows are removed here (rule 2) */
     RETRIEVE (temp.dept,a=AVG(temp.age BY temp.dept))

Here are some other interesting queries involving joins  from  different
relations:

Example 1: given 2 relations, one containing keywords which index  books
(referenced  by  their  isbn#) WHERE any book may be indexed by multiple
keywords (one per row in bookkeys table). The other relation contains  a
list of wanted keys:

bookkeys(isbn,keyword)          /* 1 row/keyword for each book*/
wantedkeys(keyword)             /*a list of keys desired */

Which books are indexed by all the keywords listed in  the  'wantedkeys'
relation ?

     RETRIEVE (b.all)
     WHERE count(w.keyword)=
             count(b.isbn BY b.isbn WHERE b.keyword=w.keyword)

Example 2:  given 3 relations containing information on purchase orders,
the detail lines on the orders and a price catalog

header(po_no,date)                      /* a row for each purchase order */
details(po_no,line_no,item_no,qty)      /* a row for each line on each P.O. */
items(item_no, price,desc)              /* a row for each item in the catalog */

What is total cost of each order?

     RETRIEVE(h.po_no,s=SUM(d.qty*i.price  BY h.po_no
             WHERE  h.po_no=d.po_no AND d.item_no=i.item_no))

Here the aggregate argument 'd.qty*i.price' involves  an  expression  on
values from 2 different relations and the BY clause references a third
relation.

If one wished to list the fractional cost of each item on each  purchase
order  with respect to the total cost of the order, then the query would

need an inner WHERE clause for the aggregate and an outer  WHERE  clause
for the query:

     RETRIEVE (h.po_no,
               i.desc,
               f=(d.qty*i.price)/SUM(d.qty*i.price BY h.po_no
                     WHERE h.po_no=d.po_no AND d.item_no=i.item_no))
      WHERE h.po_no=d.po_no AND d.item_no=i.item_no

The outer WHERE clause would be required  because the inner WHERE clause
is  NOT visible to the outer query, only the by clause is visible.  Con-
sequently, the outer WHERE clause is used to link  the  relations  whose
columns  are  listed in the outer target list, independent of the aggre-
gate.

     'Decomposing' this query makes it easier to visualize:

     RETRIEVE into temp(h.po_no,s=SUM(d.qty*i.price BY h.po_no
                     WHERE h.po_no=d.po_no AND d.item_no=i.item_no))

     RETRIEVE (h.po_no,i.desc,f=(d.qty*i.price)/temp.s)
             WHERE   h.po_no=d.po_no AND 
                     d.item_no=i.item_no AND
                     temp.po_no=h.po_no

>From this example, one can see how powerful  INGRES  aggregates  are  at
handling complex problems in a succinct manner. 


Summary
-------
By following the five basic rules that govern the  execution  of  aggre-
gates,  it is possible to pose and debug quite complex queries. However,
it is always wise to test out the queries on a subset of the data  WHERE
the  results  can  be  manually calculated first in order to verify that
your QUEL query does in fact represent the query to be posed.



Releases affected:     all(all.all) -   Releases not affected: 
Errors:                                                             
Bugs/SIRS:
Ingres Database Reference
To William's Home Page
Email William