Document #: US-13262,EN
------------------------------------------------------------------------------
Major subject: quel Minor subjects: quel_syntax
Keywords: aggregates, quel, tech_notes
Abstract:
Aggregates in QUEL: Equivalent to Release 6 Technical Notes
#22 and #23, or Release 5 Technical Notes #29 and #30.
Expert note:
Aggregates in QUEL
==================
Overview
--------
Aggregates are a powerful construct in INGRES to summarize data. The
purpose of this document is to explain how to use aggregates
effectively through the use of example queries. To restrict the size of
this document, no data is presented. Readers are urged to set up their own
relations to try out the examples and check the results.
Aggregates are of two main types, simple aggregates and partitioned
aggregates (these are referred to as 'aggregate functions' in the
manual). Simple aggregates return a single value which can be listed,
referenced in a where clause or embedded in expressions and treated like
any other numeric value. Partitioned aggregates (those with a BY
clause) return a set of values and most often this set of values is
'joined' back to a base relation by the query referencing the parti-
tioned aggregate.
o Aggregate Syntax
o Aggregates in the WHERE Clause
o Aggregates With an Embedded WHERE Clause
o Simple Aggregates with Joins
o The Aggregate 'Rules'
o Aggregate Functions (or Partitioned Aggregates)
o Partitioned Aggregates in the WHERE Clause
o Partitioned Aggregates with Embedded WHERE Clauses
o Multicolumn and Aggregate BY Clauses
o The ANY Aggregate
o Aggregates and Joins
o Summary
Aggregate Syntax
----------------
An aggregate has the following syntax:
AGG_OP (AGG_EXPR [by BY_EXPR] [where BOOL_EXPR] )
Simple aggregates are those that do not have a BY_EXPR.
e.g. AVG(r.salary WHERE r.age>50)
here the AGG_OP is AVG (the averaging aggregate) the AGG_EXPR is
r.salary and the BOOL_EXPR is r.age>50.
Aggregates can be used in appends, replaces, deletes as well as
retrievals. Also they can be used in view definitions to define 'vir-
tual' columns involving calculations that are done dynamically as the
view is referenced. This note will concentrate on examples using
retrieves.
The AGG_OPs available are
COUNT MIN MAX AVG SUM ANY
Also there are 'prime' AGG_OPs that only consider unique values of the
argument(AGG_EXPR):
COUNTU SUMU AVGU
Note: "COUNT" counts the existence of rows. The column 'counted' may
be of any datatype.
"ANY" is a special case of COUNT that returns the value 1 if a row
exists, or zero if it does not exist. It is a more efficient version of
count since it stops counting as soon as the count becomes 1.
"MIN" and "MAX" work on all datatypes including date and character
strings but in most cases will require a scan of the relation (unless it
is a qualified aggregate).
"AVG" and "SUM" work only on numeric datatypes (integer,float,money)
Consider the following relation (range variable e):
employees(name,dept,age,salary)
which lists employees names with the department they work in, their age
and current salary.
Consider the following sequence of queries:
What is average age of all employees?
RETRIEVE(a=AVG(e.age))
How many years difference between the oldest and youngest employee?
RETRIEVE(diff=(MAX(e.age)-MIN(e.age)) )
The prime aggregates are used to count only unique values of the argu-
ment:
How many departments are there?
RETRIEVE(c=COUNTU(e.dept))
Using count above would simply give the number of employees, not depart-
ments.
Aggregates can also be used in expressions:
What is the highest ratio of salary to age?
RETRIEVE(m=MAX(e.salary/e.age))
Aggregates in the WHERE Clause
------------------------------
Moving onto more involved queries where aggregates are used to restrict
which rows are returned by the query:
Who earns the maximum salary?
RETRIEVE(e.name) WHERE e.salary=MAX(e.salary)
The principal rule here is that the aggregate gets evaluated once, then
is used as a constant in the where clause. Consequently, the relation
would be visited twice to execute this query.
By how much does each employees earn more than the average?
RETRIEVE(e.name,diff=e.salary-AVG(e.salary))
where e.salary>AVG(e.salary)
In the above query, simple aggregates can be used in calculations like
any other expression. Note that since QUEL does not allow 'local' vari-
ables the reference to 'AVG(e.salary)' has to be made twice. However the
aggregate optimizer can recognize this and execute it once only (but
only if they are at the same nesting 'level').
Who earns a median salary (closest to average)?
RETRIEVE(e.name) WHERE abs((e.salary-AVG(e.salary)))
=MIN(e.salary-AVG(e.salary))
The average would need to be evaluated twice in the above query because
it appears at different levels of nesting in different expressions. It
would improve performance to compute it first then embed its value into
the query.
Aggregates With an Embedded WHERE Clause
----------------------------------------
A WHERE clause embedded in an aggregate restricts the rows (hence the
values) being considered by the aggregate.
List the average age of those employees earning more than $40000 :
RETRIEVE(a=AVG(e.age WHERE e.salary>40000))
NOT : RETRIEVE (a=AVG(e.age)) WHERE e.salary>40000
It is important to understand here why the WHERE clause outside the
aggregate will not produce meaningful results. The aggregate will be
calculated first independently of the WHERE clause and its value will be
listed if there exists a person earning more than $40000. In the former
(correct) query, the WHERE clause is inside the aggregate so influencing
the aggregate calculation.
What is the average salary of those employees that are aged more than
the average age of all employees?
RETRIEVE(a=AVG(e.salary WHERE e.age>AVG(e.age)))
In the above case one aggregate is 'nested' within another. Nesting can
occur to any depth and the first rule of aggregates is that innermost
aggregates are evaluated first.
Introducing a new column 'sex' into the employee relation with value
'f' or 'm' :
How many male employees are there and how many female employees?
RETRIEVE (num_male=COUNT(e.name WHERE e.sex='m'),
num_female=COUNT(e.name WHERE e.sex='f'))
This query will involve a double scan of the relation and is usually
better posed using a BY clause:
RETRIEVE (e.sex,c=COUNT(e.name BY e.sex))
Simple Aggregates with Joins
----------------------------
In addition to the employee table, consider the table (range variable
c):
children(pname,cname,age)
which contains 1 row/child for any employee with a family (assume it is
not possible that both parents in a family are employees). Pname refers
to the parent's name in the employee table and cname is the first name
of each child. The age of each child is also listed.
How many of the parents in the employee relation have children?
RETRIEVE (cu=COUNTu(c.pname))
does the trick but doesn't check with the employee relation. To do this
use:
RETRIEVE (cu=COUNTu(e.name WHERE e.name=c.pname))
Here the aggregate should be viewed as a temporary retrieval first e.g.
RETRIEVE into temp(e.name) WHERE e.name=c.pname
then the query becomes
RETRIEVE (cu=COUNTu(temp.name))
Countu must be used because the temporary relation does NOT remove
duplicates and so would reference a parent multiple times if the parent
had more than 1 child.
How many parents earning more than the average salary have children aged
under 3 ?
RETRIEVE(cu=
COUNTu(e.name WHERE e.name=c.pname AND
c.age<3 AND
e.salary>(AVG(e.salary)))
How many employees have no children ?
RETRIEVE (n=(COUNT(e.name)-COUNTu(e.name WHERE e.name=c.pname)))
However,
Which employees have more than 2 children?
RETRIEVE (e.name) WHERE COUNT(c.cname WHERE e.name=c.pname)>2
doesn't work because the COUNT will only be executed once and if there
is at least 1 family with more than 2 children then the WHERE clause is
true and consequently all of the employee names will be listed !
To do this type of query one needs the BY clause in the query to
'link' the aggregate to the target list:
RETRIEVE(e.name) WHERE COUNT(c.cname BY e.name WHERE e.name=c.pname)>2
Remember:
--------
1. In a nested aggregate, the aggregates are evaluated from the inner-
most on outwards.
2. Duplicates are not removed if a join is specified in the WHERE clause
within an aggregate.
PARTITIONED or AGGREGATE FUNCTIONS with BY CLAUSES
--------------------------------------------------
More powerful classes of query can be handled by making use of the BY
clause in an aggregate. These queries are known as 'partitioned' aggregates.
These are also referred to in the documentation as 'aggregate functions'.
Partitioned aggregates give one the power to pose very complex queries
in a very succinct manner. Two main principles must be kept in mind when
using them: the concept of 'visibility' of the BY clause and the fact
that the WHERE clause within the aggregate affects the aggregate cal-
culation, not the partitions. These and other issues will be illustrated
through a series of examples.
The Aggregate 'Rules'
----------------------
1. In a nested aggregate, the aggregates are evaluated from the inner-
most outward. When aggregates exist at the same level of nesting, it is
the optimizer (JOINOP) and the aggregate optimizer that decide the order
of execution.
2. Duplicates are not removed if a join is specified in the WHERE
clause of an aggregate.
3. The variable(s) in the BY clause are visible to the outer expres-
sion. (this visibility applies to BY clauses involving other parti-
tioned aggregates.)
4. If the BY clause variable(s) are visible to the outer expression,
it is joined back to a 'duplicate-free projection' of the outer expres-
sion.
5. The WHERE clause in a partitioned aggregate restricts the aggregate
calculation, NOT the partitions. (However 'set aggregate noproject' can
be used to override this)
Using BY clauses.
----------------
The use of BY clauses will be illustrated through examples. To keep
the size down no data is listed. Readers are therefore urged to
set up their own relations to test the examples.
The examples continue by making use of the 2 relations explained earlier.
employees(name,dept,age,salary) (range variable 'e')
children (pname,cname,age) (range variable 'c')
Aggregate Functions (or Partitioned Aggregates)
-----------------------------------------------
These are aggregates that have a BY clause in them and so do not
return a single value, but a set of values (like a temporary relation).
Using these requires a little more care than using simple aggregates but
by following a few basic rules, can be quickly mastered.
List the average salary for each dept
RETRIEVE(a=AVG(e.salary BY e.dept))
This query involves generating a temporary relation that has one row for
each dept in the 'e' relation, then averages the salary for all rows in
'e' that have that dept, i.e. we are summarizing the data in e 'parti-
tioned' by each dept. However, as the query stands above, only the aver-
ages will be listed. Normally, to obtain more meaningful results, we
include the BY argument in the target list so as to indicate which
average refers to which dept partition.
RETRIEVE (e.dept,a=AVG(e.salary BY e.dept))
| |
+------linkage----------+
Here we introduce the concept of 'visibility' of the BY argument,
visibility here meaning that its values can be 'linked' or 'joined' to
the outside of the aggregate.
The important rules to remember here are rules 3 and 4:
The variable(s) in the by clause are visible to the outer expression and
refer to a duplicate-free projection of the outer expression.
This query effectively lists the BY value with each average. However
if more information is requested in the target list, then the aggregate
temporary relation is effectively 'joined' to a duplicate-free projec-
tion of the other target list columns.
RETRIEVE(e.age,e.dept,a=AVG(e.salary BY e.dept))
will be executed as follows: the temporary aggregate will be calculated.
Also a duplicate free list of (age,dept) pairs will be generated and
i.e. the query could be viewed as the following sequence of operations:
RETRIEVE into t1 (e.dept,a=AVG(e.salary BY e.dept))
RETRIEVE into t2 (e.age,e.dept) /* the 'projection' */
MODIFY t2 to heapsort /* to remove duplicate rows*/
RETRIEVE (t2.all,t1.a) WHERE t1.dept=t2.dept
If e.dept is left out of the target list:
RETRIEVE(e.age,a=AVG(e.salary BY e.dept))
the result would have the same as above except that the column dept will
be omitted, as though the join on dept still took place, but no listing
of the dept appears in the result:
RETRIEVE (t2.age,t1.a) WHERE t1.dept=t2.dept
Partitioned Aggregates in the WHERE Clause
------------------------------------------
In much the same way as BY clauses are linked within the target list,
they can be linked between the target list and a WHERE clause and
between aggregates within the WHERE clause.
Which depts have an average salary greater than $30000 ?
RETRIEVE(e.dept) WHERE AVG(e.salary BY e.dept) > 30000
| |
+--------linkage-------------+
List the average salary for the above depts
RETRIEVE(e.dept,a=AVG(e.salary BY e.dept))
WHERE AVG(e.salary BY e.dept) >30000
One would like to be able to use the WHERE clause '... WHERE a > 30000'
but local variables cannot be used in QUEL expressions, however, the
aggregate optimizer recognizes that this expression is repeated at the
same level of nesting and only evaluates it once.
The above query could be viewed as the following sequence of operations:
RETRIEVE into t1(e.dept,a=AVG(e.salary BY e.dept))
RETRIEVE(t1.dept,t1.a) WHERE t1.a > 30000
List the average salary for those people who work in depts where at
least 2000 other employees work and where the maximum salary is less
than $40000
RETRIEVE (a=AVG(p.salary BY p.dept)) WHERE COUNT(p.name BY p.dept) > 2000
AND MAX(p.salary BY p.dept) < 40000
List those departments WHERE the salary is the same for all employees
RETRIEVE(e.dept) WHERE MIN(e.salary BY e.dept)=MAX(e.salary BY e.dept)
This class of query is interesting because two partitioned aggregates
are compared row by row for each dept.
Partitioned Aggregates with Embedded WHERE Clauses
--------------------------------------------------
Consider the following (assuming that dept refers to a numeric code):
List those departments whose code is greater than 50 with the average
salary for those depts
RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50))
This class of query introduces rule 5:
The WHERE clause in the aggregate restricts the aggregate calculation,
not the BY clause partitions.
In the above query this rule means that the result will be printed out
for ALL depts in the 'e' relation, but that the average will be listed
as zero for all except for those depts with a code greater than 50. l aggre-
clause. To get around this problem one can use one of 2 approaches:
a) add an outer WHERE clause:
RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50))
WHERE e.dept>50
or
b) use a temporary relation:
RETRIEVE into temp(e.dept,e.salary) WHERE e.salary>50
RETRIEVE(temp.age,a=AVG(temp.salary BY temp.age))
The second case is more effective when there is a primary or secondary
index on columns in the WHERE clause and the WHERE clause is very res-
trictive.
The set command (or use ing_set):
set aggregate noproject
means the WHERE clause will affect the partitions collected before the
aggregate calculate, canceling rule 3.
Consider another case:
List all departments and the average salaries of all employees with a
salary greater than 40000 in those depts
RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.salary>40000))
Here the WHERE clause obviously does not affect the partitions but will
affect those rows used in the partition.
Multicolumn and Aggregate BY Clauses
------------------------------------
The by clause can contain multiple column references,expressions and
also aggregates. Consider the following:
List the average salary for each sex in each dept
RETRIEVE (e.dept,e.sex,a=AVG(e.salary BY e.dept,e.sex))
List the average salary for each dept size
RETRIEVE(c=count(e.name BY e.dept),
a=AVG(e.salary BY count(e.name BY e.dept)))
List the average salary for each family size
RETRIEVE(c=count(c.cname BY e.name WHERE e.name=c.pname),
a=AVG(e.salary BY count(c.cname BY e.name
WHERE e.name=c.pname)))
List the average family size
RETRIEVE(a=AVG(count(c.cname BY e.name WHERE e.name=c.pname))
The ANY Aggregate
-----------------
This is a much more efficient version of count for use in certain situa-
tions
Which employees have children?
RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=1
Which employees do NOT have children
RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=0
Counting stops as soon as the count becomes 1 in each case.
Aggregates and Joins
--------------------
Joins are required in the WHERE clause of aggregates when the columns
referenced in the argument of the aggregate and the columns referenced
in the BY expression come from different relations.
List the average age of children of the employees of each department
RETRIEVE( e.dept,a=AVG(c.age BY e.dept WHERE e.name=c.pname))
which can be viewed as the following sequence of operations:
RETRIEVE into temp(c.age,e.dept) WHERE e.name=c.pname
/* note NO duplicates rows are removed here (rule 2) */
RETRIEVE (temp.dept,a=AVG(temp.age BY temp.dept))
Here are some other interesting queries involving joins from different
relations:
Example 1: given 2 relations, one containing keywords which index books
(referenced by their isbn#) WHERE any book may be indexed by multiple
keywords (one per row in bookkeys table). The other relation contains a
list of wanted keys:
bookkeys(isbn,keyword) /* 1 row/keyword for each book*/
wantedkeys(keyword) /*a list of keys desired */
Which books are indexed by all the keywords listed in the 'wantedkeys'
relation ?
RETRIEVE (b.all)
WHERE count(w.keyword)=
count(b.isbn BY b.isbn WHERE b.keyword=w.keyword)
Example 2: given 3 relations containing information on purchase orders,
the detail lines on the orders and a price catalog
header(po_no,date) /* a row for each purchase order */
details(po_no,line_no,item_no,qty) /* a row for each line on each P.O. */
items(item_no, price,desc) /* a row for each item in the catalog */
What is total cost of each order?
RETRIEVE(h.po_no,s=SUM(d.qty*i.price BY h.po_no
WHERE h.po_no=d.po_no AND d.item_no=i.item_no))
Here the aggregate argument 'd.qty*i.price' involves an expression on
values from 2 different relations and the BY clause references a third
relation.
If one wished to list the fractional cost of each item on each purchase
order with respect to the total cost of the order, then the query would
need an inner WHERE clause for the aggregate and an outer WHERE clause
for the query:
RETRIEVE (h.po_no,
i.desc,
f=(d.qty*i.price)/SUM(d.qty*i.price BY h.po_no
WHERE h.po_no=d.po_no AND d.item_no=i.item_no))
WHERE h.po_no=d.po_no AND d.item_no=i.item_no
The outer WHERE clause would be required because the inner WHERE clause
is NOT visible to the outer query, only the by clause is visible. Con-
sequently, the outer WHERE clause is used to link the relations whose
columns are listed in the outer target list, independent of the aggre-
gate.
'Decomposing' this query makes it easier to visualize:
RETRIEVE into temp(h.po_no,s=SUM(d.qty*i.price BY h.po_no
WHERE h.po_no=d.po_no AND d.item_no=i.item_no))
RETRIEVE (h.po_no,i.desc,f=(d.qty*i.price)/temp.s)
WHERE h.po_no=d.po_no AND
d.item_no=i.item_no AND
temp.po_no=h.po_no
>From this example, one can see how powerful INGRES aggregates are at
handling complex problems in a succinct manner.
Summary
-------
By following the five basic rules that govern the execution of aggre-
gates, it is possible to pose and debug quite complex queries. However,
it is always wise to test out the queries on a subset of the data WHERE
the results can be manually calculated first in order to verify that
your QUEL query does in fact represent the query to be posed.
Releases affected: all(all.all) - Releases not affected:
Errors:
Bugs/SIRS:
Ingres Database Reference
To William's Home Page
© William Yuan 2000
Email William