Document #: US-13262,EN ------------------------------------------------------------------------------ Major subject: quel Minor subjects: quel_syntax Keywords: aggregates, quel, tech_notes Abstract: Aggregates in QUEL: Equivalent to Release 6 Technical Notes #22 and #23, or Release 5 Technical Notes #29 and #30. Expert note: Aggregates in QUEL ================== Overview -------- Aggregates are a powerful construct in INGRES to summarize data. The purpose of this document is to explain how to use aggregates effectively through the use of example queries. To restrict the size of this document, no data is presented. Readers are urged to set up their own relations to try out the examples and check the results. Aggregates are of two main types, simple aggregates and partitioned aggregates (these are referred to as 'aggregate functions' in the manual). Simple aggregates return a single value which can be listed, referenced in a where clause or embedded in expressions and treated like any other numeric value. Partitioned aggregates (those with a BY clause) return a set of values and most often this set of values is 'joined' back to a base relation by the query referencing the parti- tioned aggregate. o Aggregate Syntax o Aggregates in the WHERE Clause o Aggregates With an Embedded WHERE Clause o Simple Aggregates with Joins o The Aggregate 'Rules' o Aggregate Functions (or Partitioned Aggregates) o Partitioned Aggregates in the WHERE Clause o Partitioned Aggregates with Embedded WHERE Clauses o Multicolumn and Aggregate BY Clauses o The ANY Aggregate o Aggregates and Joins o Summary Aggregate Syntax ---------------- An aggregate has the following syntax: AGG_OP (AGG_EXPR [by BY_EXPR] [where BOOL_EXPR] ) Simple aggregates are those that do not have a BY_EXPR. e.g. AVG(r.salary WHERE r.age>50) here the AGG_OP is AVG (the averaging aggregate) the AGG_EXPR is r.salary and the BOOL_EXPR is r.age>50. Aggregates can be used in appends, replaces, deletes as well as retrievals. Also they can be used in view definitions to define 'vir- tual' columns involving calculations that are done dynamically as the view is referenced. This note will concentrate on examples using retrieves. The AGG_OPs available are COUNT MIN MAX AVG SUM ANY Also there are 'prime' AGG_OPs that only consider unique values of the argument(AGG_EXPR): COUNTU SUMU AVGU Note: "COUNT" counts the existence of rows. The column 'counted' may be of any datatype. "ANY" is a special case of COUNT that returns the value 1 if a row exists, or zero if it does not exist. It is a more efficient version of count since it stops counting as soon as the count becomes 1. "MIN" and "MAX" work on all datatypes including date and character strings but in most cases will require a scan of the relation (unless it is a qualified aggregate). "AVG" and "SUM" work only on numeric datatypes (integer,float,money) Consider the following relation (range variable e): employees(name,dept,age,salary) which lists employees names with the department they work in, their age and current salary. Consider the following sequence of queries: What is average age of all employees? RETRIEVE(a=AVG(e.age)) How many years difference between the oldest and youngest employee? RETRIEVE(diff=(MAX(e.age)-MIN(e.age)) ) The prime aggregates are used to count only unique values of the argu- ment: How many departments are there? RETRIEVE(c=COUNTU(e.dept)) Using count above would simply give the number of employees, not depart- ments. Aggregates can also be used in expressions: What is the highest ratio of salary to age? RETRIEVE(m=MAX(e.salary/e.age)) Aggregates in the WHERE Clause ------------------------------ Moving onto more involved queries where aggregates are used to restrict which rows are returned by the query: Who earns the maximum salary? RETRIEVE(e.name) WHERE e.salary=MAX(e.salary) The principal rule here is that the aggregate gets evaluated once, then is used as a constant in the where clause. Consequently, the relation would be visited twice to execute this query. By how much does each employees earn more than the average? RETRIEVE(e.name,diff=e.salary-AVG(e.salary)) where e.salary>AVG(e.salary) In the above query, simple aggregates can be used in calculations like any other expression. Note that since QUEL does not allow 'local' vari- ables the reference to 'AVG(e.salary)' has to be made twice. However the aggregate optimizer can recognize this and execute it once only (but only if they are at the same nesting 'level'). Who earns a median salary (closest to average)? RETRIEVE(e.name) WHERE abs((e.salary-AVG(e.salary))) =MIN(e.salary-AVG(e.salary)) The average would need to be evaluated twice in the above query because it appears at different levels of nesting in different expressions. It would improve performance to compute it first then embed its value into the query. Aggregates With an Embedded WHERE Clause ---------------------------------------- A WHERE clause embedded in an aggregate restricts the rows (hence the values) being considered by the aggregate. List the average age of those employees earning more than $40000 : RETRIEVE(a=AVG(e.age WHERE e.salary>40000)) NOT : RETRIEVE (a=AVG(e.age)) WHERE e.salary>40000 It is important to understand here why the WHERE clause outside the aggregate will not produce meaningful results. The aggregate will be calculated first independently of the WHERE clause and its value will be listed if there exists a person earning more than $40000. In the former (correct) query, the WHERE clause is inside the aggregate so influencing the aggregate calculation. What is the average salary of those employees that are aged more than the average age of all employees? RETRIEVE(a=AVG(e.salary WHERE e.age>AVG(e.age))) In the above case one aggregate is 'nested' within another. Nesting can occur to any depth and the first rule of aggregates is that innermost aggregates are evaluated first. Introducing a new column 'sex' into the employee relation with value 'f' or 'm' : How many male employees are there and how many female employees? RETRIEVE (num_male=COUNT(e.name WHERE e.sex='m'), num_female=COUNT(e.name WHERE e.sex='f')) This query will involve a double scan of the relation and is usually better posed using a BY clause: RETRIEVE (e.sex,c=COUNT(e.name BY e.sex)) Simple Aggregates with Joins ---------------------------- In addition to the employee table, consider the table (range variable c): children(pname,cname,age) which contains 1 row/child for any employee with a family (assume it is not possible that both parents in a family are employees). Pname refers to the parent's name in the employee table and cname is the first name of each child. The age of each child is also listed. How many of the parents in the employee relation have children? RETRIEVE (cu=COUNTu(c.pname)) does the trick but doesn't check with the employee relation. To do this use: RETRIEVE (cu=COUNTu(e.name WHERE e.name=c.pname)) Here the aggregate should be viewed as a temporary retrieval first e.g. RETRIEVE into temp(e.name) WHERE e.name=c.pname then the query becomes RETRIEVE (cu=COUNTu(temp.name)) Countu must be used because the temporary relation does NOT remove duplicates and so would reference a parent multiple times if the parent had more than 1 child. How many parents earning more than the average salary have children aged under 3 ? RETRIEVE(cu= COUNTu(e.name WHERE e.name=c.pname AND c.age<3 AND e.salary>(AVG(e.salary))) How many employees have no children ? RETRIEVE (n=(COUNT(e.name)-COUNTu(e.name WHERE e.name=c.pname))) However, Which employees have more than 2 children? RETRIEVE (e.name) WHERE COUNT(c.cname WHERE e.name=c.pname)>2 doesn't work because the COUNT will only be executed once and if there is at least 1 family with more than 2 children then the WHERE clause is true and consequently all of the employee names will be listed ! To do this type of query one needs the BY clause in the query to 'link' the aggregate to the target list: RETRIEVE(e.name) WHERE COUNT(c.cname BY e.name WHERE e.name=c.pname)>2 Remember: -------- 1. In a nested aggregate, the aggregates are evaluated from the inner- most on outwards. 2. Duplicates are not removed if a join is specified in the WHERE clause within an aggregate. PARTITIONED or AGGREGATE FUNCTIONS with BY CLAUSES -------------------------------------------------- More powerful classes of query can be handled by making use of the BY clause in an aggregate. These queries are known as 'partitioned' aggregates. These are also referred to in the documentation as 'aggregate functions'. Partitioned aggregates give one the power to pose very complex queries in a very succinct manner. Two main principles must be kept in mind when using them: the concept of 'visibility' of the BY clause and the fact that the WHERE clause within the aggregate affects the aggregate cal- culation, not the partitions. These and other issues will be illustrated through a series of examples. The Aggregate 'Rules' ---------------------- 1. In a nested aggregate, the aggregates are evaluated from the inner- most outward. When aggregates exist at the same level of nesting, it is the optimizer (JOINOP) and the aggregate optimizer that decide the order of execution. 2. Duplicates are not removed if a join is specified in the WHERE clause of an aggregate. 3. The variable(s) in the BY clause are visible to the outer expres- sion. (this visibility applies to BY clauses involving other parti- tioned aggregates.) 4. If the BY clause variable(s) are visible to the outer expression, it is joined back to a 'duplicate-free projection' of the outer expres- sion. 5. The WHERE clause in a partitioned aggregate restricts the aggregate calculation, NOT the partitions. (However 'set aggregate noproject' can be used to override this) Using BY clauses. ---------------- The use of BY clauses will be illustrated through examples. To keep the size down no data is listed. Readers are therefore urged to set up their own relations to test the examples. The examples continue by making use of the 2 relations explained earlier. employees(name,dept,age,salary) (range variable 'e') children (pname,cname,age) (range variable 'c') Aggregate Functions (or Partitioned Aggregates) ----------------------------------------------- These are aggregates that have a BY clause in them and so do not return a single value, but a set of values (like a temporary relation). Using these requires a little more care than using simple aggregates but by following a few basic rules, can be quickly mastered. List the average salary for each dept RETRIEVE(a=AVG(e.salary BY e.dept)) This query involves generating a temporary relation that has one row for each dept in the 'e' relation, then averages the salary for all rows in 'e' that have that dept, i.e. we are summarizing the data in e 'parti- tioned' by each dept. However, as the query stands above, only the aver- ages will be listed. Normally, to obtain more meaningful results, we include the BY argument in the target list so as to indicate which average refers to which dept partition. RETRIEVE (e.dept,a=AVG(e.salary BY e.dept)) | | +------linkage----------+ Here we introduce the concept of 'visibility' of the BY argument, visibility here meaning that its values can be 'linked' or 'joined' to the outside of the aggregate. The important rules to remember here are rules 3 and 4: The variable(s) in the by clause are visible to the outer expression and refer to a duplicate-free projection of the outer expression. This query effectively lists the BY value with each average. However if more information is requested in the target list, then the aggregate temporary relation is effectively 'joined' to a duplicate-free projec- tion of the other target list columns. RETRIEVE(e.age,e.dept,a=AVG(e.salary BY e.dept)) will be executed as follows: the temporary aggregate will be calculated. Also a duplicate free list of (age,dept) pairs will be generated and i.e. the query could be viewed as the following sequence of operations: RETRIEVE into t1 (e.dept,a=AVG(e.salary BY e.dept)) RETRIEVE into t2 (e.age,e.dept) /* the 'projection' */ MODIFY t2 to heapsort /* to remove duplicate rows*/ RETRIEVE (t2.all,t1.a) WHERE t1.dept=t2.dept If e.dept is left out of the target list: RETRIEVE(e.age,a=AVG(e.salary BY e.dept)) the result would have the same as above except that the column dept will be omitted, as though the join on dept still took place, but no listing of the dept appears in the result: RETRIEVE (t2.age,t1.a) WHERE t1.dept=t2.dept Partitioned Aggregates in the WHERE Clause ------------------------------------------ In much the same way as BY clauses are linked within the target list, they can be linked between the target list and a WHERE clause and between aggregates within the WHERE clause. Which depts have an average salary greater than $30000 ? RETRIEVE(e.dept) WHERE AVG(e.salary BY e.dept) > 30000 | | +--------linkage-------------+ List the average salary for the above depts RETRIEVE(e.dept,a=AVG(e.salary BY e.dept)) WHERE AVG(e.salary BY e.dept) >30000 One would like to be able to use the WHERE clause '... WHERE a > 30000' but local variables cannot be used in QUEL expressions, however, the aggregate optimizer recognizes that this expression is repeated at the same level of nesting and only evaluates it once. The above query could be viewed as the following sequence of operations: RETRIEVE into t1(e.dept,a=AVG(e.salary BY e.dept)) RETRIEVE(t1.dept,t1.a) WHERE t1.a > 30000 List the average salary for those people who work in depts where at least 2000 other employees work and where the maximum salary is less than $40000 RETRIEVE (a=AVG(p.salary BY p.dept)) WHERE COUNT(p.name BY p.dept) > 2000 AND MAX(p.salary BY p.dept) < 40000 List those departments WHERE the salary is the same for all employees RETRIEVE(e.dept) WHERE MIN(e.salary BY e.dept)=MAX(e.salary BY e.dept) This class of query is interesting because two partitioned aggregates are compared row by row for each dept. Partitioned Aggregates with Embedded WHERE Clauses -------------------------------------------------- Consider the following (assuming that dept refers to a numeric code): List those departments whose code is greater than 50 with the average salary for those depts RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50)) This class of query introduces rule 5: The WHERE clause in the aggregate restricts the aggregate calculation, not the BY clause partitions. In the above query this rule means that the result will be printed out for ALL depts in the 'e' relation, but that the average will be listed as zero for all except for those depts with a code greater than 50. l aggre- clause. To get around this problem one can use one of 2 approaches: a) add an outer WHERE clause: RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.dept>50)) WHERE e.dept>50 or b) use a temporary relation: RETRIEVE into temp(e.dept,e.salary) WHERE e.salary>50 RETRIEVE(temp.age,a=AVG(temp.salary BY temp.age)) The second case is more effective when there is a primary or secondary index on columns in the WHERE clause and the WHERE clause is very res- trictive. The set command (or use ing_set): set aggregate noproject means the WHERE clause will affect the partitions collected before the aggregate calculate, canceling rule 3. Consider another case: List all departments and the average salaries of all employees with a salary greater than 40000 in those depts RETRIEVE (e.dept,a=AVG(e.salary BY e.dept WHERE e.salary>40000)) Here the WHERE clause obviously does not affect the partitions but will affect those rows used in the partition. Multicolumn and Aggregate BY Clauses ------------------------------------ The by clause can contain multiple column references,expressions and also aggregates. Consider the following: List the average salary for each sex in each dept RETRIEVE (e.dept,e.sex,a=AVG(e.salary BY e.dept,e.sex)) List the average salary for each dept size RETRIEVE(c=count(e.name BY e.dept), a=AVG(e.salary BY count(e.name BY e.dept))) List the average salary for each family size RETRIEVE(c=count(c.cname BY e.name WHERE e.name=c.pname), a=AVG(e.salary BY count(c.cname BY e.name WHERE e.name=c.pname))) List the average family size RETRIEVE(a=AVG(count(c.cname BY e.name WHERE e.name=c.pname)) The ANY Aggregate ----------------- This is a much more efficient version of count for use in certain situa- tions Which employees have children? RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=1 Which employees do NOT have children RETRIEVE(e.all) WHERE any(c.cname BY e.name WHERE e.name=c.pname)=0 Counting stops as soon as the count becomes 1 in each case. Aggregates and Joins -------------------- Joins are required in the WHERE clause of aggregates when the columns referenced in the argument of the aggregate and the columns referenced in the BY expression come from different relations. List the average age of children of the employees of each department RETRIEVE( e.dept,a=AVG(c.age BY e.dept WHERE e.name=c.pname)) which can be viewed as the following sequence of operations: RETRIEVE into temp(c.age,e.dept) WHERE e.name=c.pname /* note NO duplicates rows are removed here (rule 2) */ RETRIEVE (temp.dept,a=AVG(temp.age BY temp.dept)) Here are some other interesting queries involving joins from different relations: Example 1: given 2 relations, one containing keywords which index books (referenced by their isbn#) WHERE any book may be indexed by multiple keywords (one per row in bookkeys table). The other relation contains a list of wanted keys: bookkeys(isbn,keyword) /* 1 row/keyword for each book*/ wantedkeys(keyword) /*a list of keys desired */ Which books are indexed by all the keywords listed in the 'wantedkeys' relation ? RETRIEVE (b.all) WHERE count(w.keyword)= count(b.isbn BY b.isbn WHERE b.keyword=w.keyword) Example 2: given 3 relations containing information on purchase orders, the detail lines on the orders and a price catalog header(po_no,date) /* a row for each purchase order */ details(po_no,line_no,item_no,qty) /* a row for each line on each P.O. */ items(item_no, price,desc) /* a row for each item in the catalog */ What is total cost of each order? RETRIEVE(h.po_no,s=SUM(d.qty*i.price BY h.po_no WHERE h.po_no=d.po_no AND d.item_no=i.item_no)) Here the aggregate argument 'd.qty*i.price' involves an expression on values from 2 different relations and the BY clause references a third relation. If one wished to list the fractional cost of each item on each purchase order with respect to the total cost of the order, then the query would need an inner WHERE clause for the aggregate and an outer WHERE clause for the query: RETRIEVE (h.po_no, i.desc, f=(d.qty*i.price)/SUM(d.qty*i.price BY h.po_no WHERE h.po_no=d.po_no AND d.item_no=i.item_no)) WHERE h.po_no=d.po_no AND d.item_no=i.item_no The outer WHERE clause would be required because the inner WHERE clause is NOT visible to the outer query, only the by clause is visible. Con- sequently, the outer WHERE clause is used to link the relations whose columns are listed in the outer target list, independent of the aggre- gate. 'Decomposing' this query makes it easier to visualize: RETRIEVE into temp(h.po_no,s=SUM(d.qty*i.price BY h.po_no WHERE h.po_no=d.po_no AND d.item_no=i.item_no)) RETRIEVE (h.po_no,i.desc,f=(d.qty*i.price)/temp.s) WHERE h.po_no=d.po_no AND d.item_no=i.item_no AND temp.po_no=h.po_no >From this example, one can see how powerful INGRES aggregates are at handling complex problems in a succinct manner. Summary ------- By following the five basic rules that govern the execution of aggre- gates, it is possible to pose and debug quite complex queries. However, it is always wise to test out the queries on a subset of the data WHERE the results can be manually calculated first in order to verify that your QUEL query does in fact represent the query to be posed. Releases affected: all(all.all) - Releases not affected: Errors: Bugs/SIRS:
© William Yuan 2000
Email William