So far we can:
- Design an ER Diagram
- Translate it to a relational DB $R$
- Implement $R$ in SQL
- Specify constraints over $R$
However $R$ might not be well designed. And this will cause problems. We call these anomalies.
Question? Is this a good design?
Address | Name | Children |
---|---|---|
10 Green | Bob | Caleb |
10 Green | Bob | Rose |
14 Purple | Alice | Gavin |
14 Purple | Alice | Sophia |
Why or why not?
Potential Problems:
- Redundancy
- Update Anomalies
- Deletion Anomalies
How do we recognize a good design? What does good mean?
Normal Forms
There are several important normal forms: 1NF, 2NF, 3NF, BCNF, 4NF etc.
If $R^*$ is in one of these forms, then $R^*$ is guaranteed to achieve certain good properties, e.g., if $R^*$ is in Boyce Codd NF then it is guaranteed to not have certain type of redundancy.
There exist algorithms to transform $R$ into $R^*$, where $R^*$ is in one of the normal forms.
Each type of normal form is different, there are trade offs. NFs are all based on the idea of functional dependencies.
Functional Dependencies and Keys
Let’s look at an example: Is this a good design?
Address | Name | Children |
---|---|---|
10 Green | Bob | Caleb |
10 Green | Bob | Rose |
14 Purple | Alice | Gavin |
14 Purple | Alice | Sophia |
How can we make it better?
Name | Children |
---|---|
Bob | Caleb |
Bob | Rose |
Alice | Gavin |
Alice | Sophia |
Name | Address |
---|---|
Bob | 10 Green |
Alice | 14 Purple |
Why is this better?
Functional dependencies
A form of constraint (part pf the schema), finding them is part of the database design.
Definition: If two tuples agree on the attributes $a_1,a_2,a_3,\ldots,a_n$, then they must also agree on the attributes $b_1,b_2,b_3,\ldots,b_m$.
Formally: $a_1,a_2,a_3,\ldots,a_n \rightarrow b_1,b_2,b_3,\ldots,b_m$ pronounced $X$ determines $Y$.
Let’s try an example:
SALPERS_ID | SALPERS_NAME | MANAGER_ID | OFFICE | COMM |
---|---|---|---|---|
10 | Rodney Jones | 27 | Chicago | 10 |
14 | Masaji Matsu | 44 | Tokyo | 11 |
23 | Francois Moire | 35 | Brussels | 9 |
37 | Elena Hermana | 12 | B.A. | 13 |
39 | Goro Azuma | 44 | Tokyo | 10 |
27 | Terry Cardon | Chicago | 15 | |
44 | Albert Ige | 27 | Tokyo | 12 |
35 | Brigit Bovary | 27 | Brussels | 11 |
12 | Buster Sanchez | 27 | B.A. | 10 |
What FDs can we find?
Salpers_ID $\rightarrow$ Name, Manager_ID, Office, Comm
Name $\rightarrow$ Salpers_ID
Office $\xrightarrow{?}$ Comm
In general, we check if $X\rightarrow Y$ is violated by erasing all other columns and checking many to one, called functional in mathematics.
Name $\rightarrow$ Salpers_ID. We say that name determines salpers_ID in this case. If we know the name, then we can known the ID.
Salpers_ID $\rightarrow$ Name is likewise true in this case.
Consider the following instance of $R(a,b,c,d,e)$. What FDs can you observe in $R$?
a | b | c | d | e |
---|---|---|---|---|
1 | 1 | 1 | 1 | 1 |
2 | 1 | 2 | 2 | 1 |
3 | 2 | 1 | 1 | 1 |
4 | 2 | 2 | 2 | 1 |
5 | 3 | 3 | 1 | 1 |
$a$ is unique: $a\rightarrow b, a\rightarrow c, a\rightarrow d, a\rightarrow e \equiv a\rightarrow a,b,c,d$
$e$ are all the same: $a\rightarrow e, b\rightarrow e, c\rightarrow e, d\rightarrow e$
Others:
Combinations of $b,c$ are unique: $b,c\rightarrow a,d,e$
Combinations of $b,d$ are unique: $b,d \rightarrow a,c,e$
if $c$ values match, so do $d$ values: $c\rightarrow d$
$d$ values do not determine $c$: $d\nrightarrow c$
lots of others… $a,e\rightarrow b,c$, etc.
Reasoning with FDs
Relation Keys
Now that we know FDs we can define keys mathematically.
Key of a relation $R$ is a set of attributes that:
- Functionally determines all attributes of $R$
- None of its subsets determines all attributes of $R$.
Superkey
- A set of attributes that contains a key.
A key is the minimal set.
We need to make sure our keys are well defined.
Closure of FD Sets
Given a relation schema $R$ and a set of FDs $S$, we ask:
Is FD $f$ logically implied by $S$?
Example:
$R = \{a,b,c,g,h,i\}$
$S=\{a\rightarrow b, a\rightarrow c, c,g\rightarrow h, c,g\rightarrow i, b\rightarrow h\}$
Does $a\rightarrow h$ follow? Yes, and we can prove this.
Closure of $S : S^+$ contains all of the FDs that follow. We use Armstrongs axioms to find these.
Armstrongs Axioms
wlog: $X=\{x_1, x_2,\ldots,x_n\}$
Reflexivity:
$X\rightarrow $ some subset of $X$
Augmentation
if $X\rightarrow Y$, then $X,Z\rightarrow Y,Z$
Transitivity
if $X\rightarrow Y\wedge Y\rightarrow Z$, then $X\rightarrow Z$
Union
if $X\rightarrow Y\wedge X\rightarrow Z$, then $X\rightarrow Y,Z$
Decomposition
if $X\rightarrow Y,Z$, then $X\rightarrow Y \wedge X\rightarrow Z$
Pseudo-Transitivity
if $X\rightarrow Y\wedge YZ\rightarrow U$, then $XZ\rightarrow U$
Example:
$R(a,b,c), S=\{a\rightarrow b, c\rightarrow a\}$
What is the FD closure $S^+$?
- $c\rightarrow b$ (transitivity)
- $c,a \rightarrow c,a,b$ (augmentation on 1)
- $b,c\rightarrow a,b,c$ (augmentation on $c\rightarrow a$)
- $a,b,c \rightarrow a,b,c$ (reflexivity)
Say $R(a,b,c,d), S=\{a,b\rightarrow c, b,c\rightarrow a,d\}$. Does $a,c\rightarrow d$ follow from $S$?
Solve by checking $d\in \{a,c\}^+$
In this case, $\{a,c\}^+ = \{a,c\}$, and since $d \notin \{a,c\}^+ $, then $a,c\rightarrow d$ does not follow from $S$
Closure of Attribute Sets
Given a set of attributes $A={a_1,a_2,\ldots,a_n}$ and a set of dependencies $S$. We want to find all the attributes $B$ such that any relation which satisfies $S$ also satisfies $\{a_1,\ldots,a_n\}\rightarrow B$.
This closure is the set of all such attributes $B$ and is denoted $\{a_1,a_2,\ldots,a_n\}^+$. We use this to test if $A$ is a superkey.
We first compute $X^+$ and check of $X^+$ contains all attributes of $R$. We can also check if $X\rightarrow Y$ holds by checking if $Y\in X^+$
Example: finding keys.
$R(abc), S=\{a\rightarrow b, c\rightarrow a\}$
Consider all attribute subsets $X$
Compute the closure of each subset $X^+$
Determine if the subset X is a superkey or key
$X$ | a | b | c | a,b | a,c | a,c | a,b,c |
---|---|---|---|---|---|---|---|
$X^+$ | a,b | b | a,b,c | a,b | a,b,c | a,b,c | a,b,c |
– | – | key | – | superkey | superkey | superkey |
Another Example:
$R(abcd),S=\{a \rightarrow b,d \rightarrow c,c\rightarrow a\}$
Find the keys and superkeys
$X$ | a | b | c | d | a,b | a,c | a,d | b,c | b,d | c,d | a,b,c | a,c,d | a,b,d | b,c,d | a,b,c,d |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
$X^+$ | a,b | b | a,b,c | abcd | ab | abc | abcd | abc | abcd | abcd | abc | abcd | abcd | abcd | abcd |
– | – | – | Key | – | – | SK | – | SK | SK | – | SK | SK | SK | SK |
Desirable properties of schema refinement:
- Eliminate anomalies (BCNF)
- Avoid information loss (BCNF)
- Preserve dependencies (3NF, not BCNF)
- Ensure good query performance (??)