So far we can:

- Design an ER Diagram
- Translate it to a relational DB $R$
- Implement $R$ in SQL
- Specify constraints over $R$

However $R$ might not be well designed. And this will cause problems. We call these *anomalies.*

Question? Is this a good design?

Address | SSN | Phone |
---|---|---|

10 Green | 123-45-6789 | 521-632-6859 |

10 Green | 123-45-6789 | 587-658-9362 |

14 Purple | 909-64-5874 | 857-965-6951 |

14 Purple | 909-64-5874 | 547-658-7854 |

Why or why not?

Potential Problems:

- Redundancy
- Update Anomalies
- Deletion Anomalies

How do we recognize a **good **design? What does **good** mean?

## Normal Forms

There are several important normal forms: 1NF, 2NF, 3NF, BCNF, 4NF etc.

If $R^*$ is in one of these forms, then $R^*$ is guaranteed to achieve certain good properties, *e.g.*, if $R^*$ is in Boyce Codd NF then it is guaranteed to not have certain type of redundancy.

There exist algorithms to transform $R$ into $R^*$, where $R^*$ is in one of the normal forms.

Each type of normal form is different, there are trade offs. NFs are all based on the idea of functional dependencies.

## Functional Dependencies and Keys

Let’s look at an example: Is this a good design?

Address | SSN | Phone |
---|---|---|

10 Green | 123-45-6789 | 521-632-6859 |

10 Green | 123-45-6789 | 587-658-9362 |

14 Purple | 909-64-5874 | 857-965-6951 |

14 Purple | 909-64-5874 | 547-658-7854 |

How can we make it better?

SSN | Phone |
---|---|

123-45-6789 | 521-632-6859 |

123-45-6789 | 587-658-9362 |

909-64-5874 | 857-965-6951 |

909-64-5874 | 547-658-7854 |

SSN | Address |
---|---|

123-45-6789 | 10 Green |

909-64-5874 | 14 Purple |

Why is this better?

## Functional dependencies

A form of constraint (part pf the schema), finding them is part of the database design.

Definition: If two tuples agree on the attributes $a_1,a_2,a_3,\ldots,a_n$, then they must also agree on the attributes $b_1,b_2,b_3,\ldots,b_m$.

Formally: $a_1,a_2,a_3,\ldots,a_n \rightarrow b_1,b_2,b_3,\ldots,b_m$ pronounced $X$ determines $Y$.

Let’s try an example:

SALPERS_ID | SALPERS_NAME | MANAGER_ID | OFFICE | COMM |
---|---|---|---|---|

10 | Rodney Jones | 27 | Chicago | 10 |

14 | Masaji Matsu | 44 | Tokyo | 11 |

23 | Francois Moire | 35 | Brussels | 9 |

37 | Elena Hermana | 12 | B.A. | 13 |

39 | Goro Azuma | 44 | Tokyo | 10 |

27 | Terry Cardon | Chicago | 15 | |

44 | Albert Ige | 27 | Tokyo | 12 |

35 | Brigit Bovary | 27 | Brussels | 11 |

12 | Buster Sanchez | 27 | B.A. | 10 |

What FDs can we find?

Salpers_ID $\rightarrow$ Name, Manager_ID, Office, Comm

Name $\rightarrow$ Salpers_ID

Office $\xrightarrow{?}$ Comm

In general, we check if $X\rightarrow Y$ is violated by erasing all other columns and checking many to one, called **functional **in mathematics.

Name $\rightarrow$ Salpers_ID. We say that name *determines *salpers_ID in this case. If we know the name, then we can known the ID.

Salpers_ID $\rightarrow$ Name is likewise true in this case.

Consider the following instance of $R(a,b,c,d,e)$. What FDs can you observe in $R$?

a | b | c | d | e |
---|---|---|---|---|

1 | 1 | 1 | 1 | 1 |

2 | 1 | 2 | 2 | 1 |

3 | 2 | 1 | 1 | 1 |

4 | 2 | 2 | 2 | 1 |

5 | 3 | 3 | 1 | 1 |

$a$ is unique: $a\rightarrow b, a\rightarrow c, a\rightarrow d, a\rightarrow e \equiv a\rightarrow a,b,c,d$

$e$ are all the same: $a\rightarrow e, b\rightarrow e, c\rightarrow e, d\rightarrow e$

Others:

Combinations of $b,c$ are unique: $b,c\rightarrow a,d,e$

Combinations of $b,d$ are unique: $b,d \rightarrow a,c,e$

if $c$ values match, so do $d$ values: $c\rightarrow d$

$d$ values do not determine $c$: $d\nrightarrow c$

lots of others… $a,e\rightarrow b,c$, etc.

## Reasoning with FDs

#### Relation Keys

Now that we know FDs we can define keys mathematically.

**Key** of a relation $R$ is a set of attributes that:

- Functionally determines all attributes of $R$
- None of its subsets determines all attributes of $R$.

**Superkey**

- A set of attributes that contains a key.

*A key is the minimal set.*

We need to make sure our keys are well defined.

#### Closure of FD Sets

Given a relation schema $R$ and a set of FDs $S$, we ask:

Is FD $f$ logically implied by $S$?

Example:

$R = \{a,b,c,g,h,i\}$

$S=\{a\rightarrow b, a\rightarrow c, c,g\rightarrow h, c,g\rightarrow i, b\rightarrow h\}$

Does $a\rightarrow h$ follow? *Yes, *and we can prove this.

Closure of $S : S^+$ contains all of the FDs that follow. We use **Armstrongs axioms** to find these.

#### Armstrongs Axioms

wlog: $X=\{x_1, x_2,\ldots,x_n\}$

Reflexivity:

$X\rightarrow $ some subset of $X$

Augmentation

if $X\rightarrow Y$, then $X,Z\rightarrow Y,Z$

Transitivity

if $X\rightarrow Y\wedge Y\rightarrow Z$, then $X\rightarrow Z$

Union

if $X\rightarrow Y\wedge X\rightarrow Z$, then $X\rightarrow Y,Z$

Decomposition

if $X\rightarrow Y,Z$, then $X\rightarrow Y \wedge X\rightarrow Z$

Pseudo-Transitivity

if $X\rightarrow Y\wedge YZ\rightarrow U$, then $XZ\rightarrow U$

Example:

$R(a,b,c), S=\{a\rightarrow b, c\rightarrow a\}$

What is the FD closure $S^+$?

- $c\rightarrow b$ (transitivity)
- $c,a \rightarrow c,a,b$ (augmentation on 1)
- $b,c\rightarrow a,b,c$ (augmentation on $c\rightarrow a$)
- $a,b,c \rightarrow a,b,c$ (reflexivity)

Say $R(a,b,c,d), S=\{a,b\rightarrow c, b,c\rightarrow a,d\}$. Does $a,c\rightarrow d$ follow from $S$?

Solve by checking $d\in \{a,c\}^+$

In this case, $\{a,c\}^+ = \{a,c\}$, and since $d \notin \{a,c\}^+ $, then $a,c\rightarrow d$ does not follow from $S$

#### Closure of Attribute Sets

Given a set of attributes $A={a_1,a_2,\ldots,a_n}$ and a set of dependencies $S$. We want to find all the attributes $B$ such that any relation which satisfies $S$ also satisfies $\{a_1,\ldots,a_n\}\rightarrow B$.

This closure is the set of all such attributes $B$ and is denoted $\{a_1,a_2,\ldots,a_n\}^+$. We use this to test if $A$ is a superkey.

We first compute $X^+$ and check of $X^+$ contains all attributes of $R$. We can also check if $X\rightarrow Y$ holds by checking if $Y\in X^+$

Example: finding keys.

$R(abc), S=\{a\rightarrow b, c\rightarrow a\}$

Consider all attribute subsets $X$

Compute the closure of each subset $X^+$

Determine if the subset X is a superkey or key

$X$ | a | b | c | a,b | a,c | a,c | a,b,c |
---|---|---|---|---|---|---|---|

$X^+$ | a,b | b | a,b,c | a,b | a,b,c | a,b,c | a,b,c |

– | – | key | – | superkey | superkey | superkey |

Another Example:

$R(abcd),S=\{a \rightarrow b,d \rightarrow c,c\rightarrow a\}$

Find the keys and superkeys

$X$ | a | b | c | d | a,b | a,c | a,d | b,c | b,d | c,d | a,b,c | a,c,d | a,b,d | b,c,d | a,b,c,d |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

$X^+$ | a,b | b | a,b,c | abcd | ab | abc | abcd | abc | abcd | abcd | abc | abcd | abcd | abcd | abcd |

– | – | – | Key | – | – | SK | – | SK | SK | – | SK | SK | SK | SK |

**Desirable properties of schema refinement:**

- Eliminate anomalies (BCNF)
- Avoid information loss (BCNF)
- Preserve dependencies (3NF, not BCNF)
- Ensure good query performance (??)