Note of Drexel INFO605.

Why do we need Database Normalization?

注意:Normalization这个词只限于本篇,现实生活中人们不会这样叫的

一般的DB设计是先根据事实,主观地理解,然后画出ERD,然后再将ERD转化成DB schema.

This design approach thus far has been largely intuitive, reliant upon the intuition and common sense of the database designer. So we need a formal criteria that can unambiguously (or at least less ambiguously) determine why some schemas are better than others.

也就是,噔噔噔噔~~范式(Normal Form)~~

ER modeling and resulting translations to relational schemas are both top-down intuitive database design approaches.

范式就是相反的,自下而上的设计DB的方法啦。通常现实中,程序员先用intuitive approaches设计DB,然后检查设计是否符合范式。

What is the goal of Normalization?

  1. 消除冗余啊!(最最最重要的,占99%)
  2. 让表和属性的含义更清晰
  3. 减少空值

Any given fact recorded in our database should be stored in only one authoritative location. Maintenance of such facts is thereby eased, as we create / modify / delete them one time and in one place, with none of the problems that can result from keeping multiple copies of the same data.

当然消除冗余出了能避免各种增删改查的异常外,还能减少硬盘用量,还有数据库也能跑得快一点。

突然想起以前写课程设计,有的人的数据库,插入一个数据需要跑数据库几百遍。不过老师也不管的啦能动就行啦,弄一个Fancy的界面什么的就能高分 (摊手

How to Normalize database?

范式有第12345范式,不过一般符合到第三范式(3nd Normal Form)就行了。是向上兼容的,也就是符合第二范式的肯定符合第一范式,符合第三范式的也符合第二第一范式,如此类推。通常只要符合第三范式就够了。不过,有时因为性能问题,你也可以违反第三范式的啦。

Steve说如果遇到第四第五范式的请务必打电话给他,太稀有了他要写进案例。

Functional Dependency

The majority of normalization is based on the concept of functional dependency.

我们用X ➔ Y 来表示Y依赖于X,或者说,X决定Y。也就是,如果确定了X值,就能推出唯一的Y值。

举例,每一个身份证号的值对应一个人名的姓。例如确定身份证号 “1122” 所对应的人的姓是 “陈” 。相反,姓 “陈” 的并不能对应唯一的身份证号。身份证号就是X,姓就是Y。

Functional dependencies can involve multiple attributes on the left hand (determining) side:

Building, Room ➔ SeatingCapacity

左端,要保持最简化。不要增加非必要的属性。

右端,只写一种属性。例如:

Drexel_Id ➔ LastName

Drexel_Id ➔ FirstName

不要合并写成: Drexel_Id ➔ LastName, FirstName

Remember that FDs are related to schemas (meanings of attributes) and not the extensions (data instances) available at any given time.

也就是就算你不知道年收入 (例如Salary为Null),不代表 SSN ➔ Salary 这个依赖关系不存在。

Keys

Some further consideration of keys is necessary to fully explain the 1st, 2nd and 3rd normal forms

下面这段不想写的,可是我怕自己回头忘了留个底吧。可以不看,直接跳到举例说明

  • Superkey: a set of attributes in a relation that will never have the same value acrosstuples (i.e., that can be used to uniquely identify a tuple)
  • Key: aminimal superkey – if it lost any more attributes, it would no longer be a superkey (would no longer be a unique identifier)
  • Any key that could be used as the primary key of a relation schema is called a candidate key
  • The designer chooses one key to be the primary key, and all candidate keys not chosen to be the primary key become known as alternate keys.
  • Prime attribute: member of some candidate key in a relation
  • Non-prime attribute: not a member of some candidate key

Example relation (assume that all students have SSNs):

Student (DrexelNumber, SSN, FirstName, LastName)

Superkeys : {DrexelNumber, SSN, FirstName, LastName}, {SSN, LastName}, {DrexelNumber}, {SSN},. . .

Candidate Keys : {DrexelNumber}, {SSN}

Primary Key : {DrexelNumber}

Alternate Key :{SSN}

Prime attribute : DrexelNumber, SSN

Non-prime attribute : FirstName, LastName

Normal Form

We say that a relation is normalized to the form of the highest normal form that is not violated by the structure of the relation. To remove a violation, we decompose the relation (split it vertically, breaking some columns into a separate table)

1st Normal Form

Relations in 1NF do not have repeating groups!

For example: Athlete (Bib#, FirstName, LastName, Event1, Event2, Event3, Event4)

‘Event’ attributes is a repeating group.

We need to move the repeating group to a separate table (one for each repeating group) along with a copy of the Pirmary Key from the original relation.

Athlete (Bib#, FirstName, LastName)

AthleteEvents (Bib#, Event) (note composite PK)

If a relation does not have a repeating group to start with, then it is already in 1NF

2nd Normal Form

First criterion of 2NF is that the relation meets 1NF.

2NFdepends upon the concept of full functional dependency which means that we can’t remove any attribute from the lefthand side of the dependency and still have the dependency hold for theremaining attributes.

前面说过candidate key就是那些可以选做PK的属性,都是唯一标志属性。

举例,有这样的关系 Room (Building, RoomNum, SeatingCapacity, ZipCode)

这样写的话就代表有以下Functional Dependency:

Building,RoomNum ➔ SeatingCapacity (这是full functional dependency)

Building,RoomNum ➔ ZipCode (这不是)

因为事实上,只要有Building就能确定ZipCode,不需要RoomNum,所以不是full functional dependency

改成符合2NF的话要拆分成两个表,将ZipCode和Building拆出来成另一个表:

Building (Building, ZipCode)

Room (Building, RoomNum, SeatingCapacity)

Building 是 Room表的外键

Rule: Normalize a relation to 2NF by moving attributes that are not fully functionally dependent on all candidate keys out to a separate relation, along with copies of the prime attributes on which they are dependent (now the PK of the new relation).

The PK of the new relation leaves a corresponding foreign key in the original relation (Building, in example above)

3nd Normal Form

First criterion of 3NF is that the relation is in 2NF

3NF relies upon the concept of transitive dependency

也就是存在关系X ➔ Y,Y ➔ Z,而且Y是non-prime attribute,就存在X ➔ Z(很好理解嘛)

例如 Employee(SSN, LastName, DateofBirth, DeptNumber, DeptMgr)

假设存在 DeptNumber ➔ DeptMgr

那丛SSN就能得到DeptNumber,就能的得到DeptMgr。

所以改成符合3NF的关系的话:

将DeptManager (这是non-prime attributes) 分到一个独立的表,这个表的PK为DeptNumber (中间值)

Employee (SSN, LastName, DateofBirth, DeptNumber)

Department (DeptNumber, DeptManager)

DeptNumber 是 Employee表的外键

Rule: Normalize a relation to 3NF by moving non-prime attributes that are transitively dependent on any candidate key out to a separate relation, along with copies of the attributes upon which they are non-transitively dependent (these become the PK of the new relation).

The“Y” (hop) becomes a foreign key in the original relation and points at the PKof the new relation


Here is a mnemonic: “The key, the whole key, and nothing but the key.”

当然有时候我们是需要Denormalization,可以提高读数据库的性能。一般建议先将DB设计成符合3NF的,再按需Denormalization.

Exercise

尝试一步步将下面关系写到符合1NF,2NF,3NF。

Flight (Route#, Flight#, DestinationAirport, DepartureDate, DepartureTime, Pilot1, Pilot2, Pilot3, PlaneSerial#, Capacity, LastServicedDate)

Route#, Flight# ➔ (any non-prime attribute in this relation, but maybe transitively)

Route# ➔ DestinationAirport

PlaneSerial# ➔ Capacity

PlaneSerial# ➔ LastServicedDate

Answer

我就不将外键表示出来了。在Markdown很难画我又懒得插图。

1NF:

Flight (Route#, Flight#, DestinationAirport, DepartureDate, DepartureTime, PlaneSerial#, Capacity, LastServicedDate)

FlightPilot (Route#, Flight#,Pilot)

2NF:

Flight (Route#, Flight#, DepartureDate, DepartureTime, PlaneSerial#, Capacity, LastServicedDate)

FlightPilot (Route#, Flight#, Pilot)

Route (Route#, DestinationAirport)

3NF:

Flight (Route#, Flight#, DepartureDate, DepartureTime, PlaneSerial#)

FlightPilot (Route#, Flight#, Pilot)

Route (Route#, DestinationAirport)

Plane (PlaneSerial#, Capacity, LastServicedDate)