Remember SELECT FROM WHERE GROUP BY HAVING
What is the total revenue of the offices that made at least 3 sales?
SELECT office, SUM(price*qty)
FROM salesperson, sale, product
WHERE salesperson.salpers_id = sale.salpers_id and sale.prod_id = product.prod_id
GROUP BY office
HAVING count(*) >= 3;
SELECT $S$
FROM $R_1,\ldots,R_n$
WHERE $C_1$
GROUP BY $a_1,\ldots,a_k$
HAVING $C_2$
- Compute From-WHERE part, obtain table with all attributes in $R_1,\ldots,R_n$
- Group by the attributes $a_1,\ldots,a_k$
- Compute the aggregates $C_2$ and keep only groups satisfying $C_2$.
- Compute aggregates in $S$ and return the result.
User Defined Functions (UDFs)
You can define your own UDF because some functions are not available.
Median, NTile (e.g., quartile, quintile, decile)
Why?
Because this takes too much memory.
Three categories of functions:
- Distributive (sum, count, min, max) <- these functions can be distributed easily/trivially,
- Algebratic (avg, stdev) <- these functions are simple arithmetic on distributive functions
- Holistic (median, mode, ntile) <- these functions require the viewing of the whole relation/set
Each database system has its own ways of dealing with this, one way for median is to order and select the middle number via LIMIT.
LIMIT/TOP
Say we have a million rows in a table, but we only want the top 5 results
LIMIT/TOP is the keyword to use:
Names of 5 salespersons please.
SELECT TOP 5 salpers_name
FROM Salesperson;
^ This is oracle, TOP doesn’t work in MySQL, instead we use LIMIT
SELECT salpers_name
FROM salesperson
LIMIT 5
But, which 5? Without any ordering tuples are returns how they’re stored.
ORDER BY
So we need to order the items based on some attribute
Top 2 salespersons who have made the most sales:
SELECT salpers_name, count(*) as cnt
FROM salesperson, sale
WHERE salesperson.salpers_id = sale.salpers_id
GROUP BY salpers_name
ORDER BY cnt DESC (ASC for ascending, which is also the default)
LIMIT 2; (LIMIT X,Y to skip X and print Y tuples)