Spinning Disks

The Memory Hierarchy

Cache – on the processor, 2-10Mb depending on the chip

Main memory – DDR3 4-16 GB in laptops, 200-1Tb on servers

Secondary Storage – 200GB-10TB of disk. Spinning disk is still widely used on servers

Tertiary Storage – CDs, Tape backup – very very slow, but cheap and large capacity

Disks

Disk platters are usually two sided. A HDD has many disk platters, and a disk head that moves together, never touching the disk. A Cylinder is the alignment of data vertically.

Disk platters are separated by gaps so the magnetic material doesn’t get corrupted. Gaps are about 10% of the disk space.

There are usually tens of thousands of tracks, 256 sectors per track, 4096 bytes per sector.

Files

Files are spread across sectors. The OS knows where these sectors are located.

Arranging data on a disk

A data element is represented as a record, which consists of consecutive bytes on a disk. Usually a file claims at least a whole block, and there is one table per file (usually).

Fixed length fields:

Our data schema contains column names. Each field has a data type and a length consistent with the data type. INTs are 4 Bytes by default. Char is 1-4 Bytes (depending on encoding), Float is 4 Bytes, Double is 8 bytes. Let’s assume Latin encoding for now: 1 character = 1 byte

create table customer (
 	cust_id INTEGER PRIMARY KEY, 
 	cust_name CHAR(25), 
 	city CHAR(25), 
 	country CHAR(25), 
 	beg_bal INTEGER, 
 	cur_bal INTEGER
);

Finish the data arrangement for customer:

Each record contains:

Pointer to the record schema: 8 bytes
length of record: 8 bytes
timestamp of record creation: 8 bytes 
cust_id requires 4 bytes
cust_name requires 25 bytes
city requires 25 bytes
country requires 25 bytes
beg_bal requires 4 bytes
cur_bal requires 4 bytes

Layout of a record holder

Records of the same relation are put together in a block:

The header of a block holds other information:

  • Links to other blocks
  • What relation the block belongs to
  • A table of contents for the various records in the block

Variable Length data records:

  • Data items whole size varies, non uniform length strings
  • Enormous fields
  • Unknown data type fields (XML, and NOSQL)

Earlier CHARs used one byte per character. That is because we assumed Latin character encoding.

UTF-8 uses up to 3 or 4 bytes per character (lets assume 3 bytes – MySQL does it weirdly).

In this case the header of a record tells the length of the records:

create table customer (
      cust_id INTEGER PRIMARY KEY,
      cust_name VARCHAR(25),
      city VARCHAR(25),
      country VARCHAR(25),
      beg_bal INTEGER,
      cur_bal INTEGER
);

Finish the data arrangement for customer:

Each record contains:

Pointer to the record schema: 8 bytes
length of record: 8 bytes
timestamp of record creation: 8 bytes
cust_id requires 4 bytes
cust_name requires ? bytes
city requires ? bytes
country requires ? bytes
beg_bal requires 4 bytes
cur_bal requires 4 bytes

Let’s say that cust_name and city and country are only 8 characters on average = 24 bytes max each.

cust_name requires 24 bytes + 1 for the size int
city requires 24 bytes + 1 for the size int
country requires 24 bytes + 1 for the size int

Storing HUGE items

CLOB and BLOB

Needs to store files in consecutive blocks in the database. Much like how files are stored on disk.

Usually best to store in a filesystem with pointers in DB.

Inserting Records

When we insert a record we can just append the item to the end of set of blocks. But if the record has a primary key (as it usually should), then we need to slide records around and sometimes claim nearby blocks.

Some databases have Recompaction functions that place the current data nicely together on disk by claiming entire swaths of new disk blocks.

Deleting Records

When we delete we can often reclaim lost space. Sometimes we can maintain an “available space” listing in the block header.

Updating Records

When we update a record on a fixed system then there is no problem. No changes needs to be made. If we update a variable size record then we have all the same problems as inserts and/or deletes.