Homework 4

  1. [15 points] Recall that 1 block is 1 disk sector; and 1 disk sector is 4KB (4096 bytes). Let the schema of a table be:
create table midterm_grades (
	netid CHAR(8) character set latin1,
	q1_score TINYINT,
	q1_comments varchar(100) character set utf8,
	q2_score TINYINT,
	q2_comments varchar(100) character set latin1,
	q3_score TINYINT,
	q3_comments varchar(100) character set latin1,
	q4_score TINYINT,
	q4_comments varchar(100) character set utf8,
	extra_credit TINYINT
)

Let’s say that the average comment length is 20 characters long, and that each record header contains 12 bytes of important record information.

Answer the following questions:

a. How many bytes is does the average record consume? (show your work)

b. About 960 students have taken my databases midterm at Notre Dame (total). Given that records may not span more than one disk sector. How many sectors are needed to store all the midterm records?


2. [15 points] Notice that Q1 did not contain any primary key. If we add a primary key to the netid column, then we will also need to store the index on disk too. Recalculate your answer to Q1-b to take into account a primary key using a B+ Tree on the netid column.

You can use $d$ of either 2 or 170 (or really any number in between so long as you state it clearly and show your work).


3. [30 points] UTF8 encoding is a variable length encoding. This means that sometimes it uses 1 byte, sometimes it uses 2, 3, or 4 bytes to represent a character. How does it know that a character is 1 or 2 or 3 or 4 bytes?

UTF-8 distinguishes this using a unary encoding. That is, if the bits of a character start with 0, then there is only 1 byte being read for this character. 0xxxxxxx represents a 1-byte character (basically the first 127 ascii characters). If the first bits of the character start with 110xxxxx, then there will be two bytes, 1110xxxx means 3 bytes, and 11110xxx means 4 bytes. See this table for example.

Bytes1st byte2nd byte3rd byte4th byte
10xxxxxxx   
2110xxxxx10xxxxxx
31110xxxx10xxxxxx10xxxxxx
411110xxx10xxxxxx10xxxxxx10xxxxxx

Hence, the capital ‘A’, which is 65 in decimal/ASCII is represented as 01000001 encoded as ‘A’ in UTF-8.

The winky emoji: 😉 is 128,521 in decimal which is 00000001 11110110 00001001 in binary, which is encoded into UTF8 as 11110000 10011111 10011000 10001001 (with only 21 encoded bits)

“Encoding errors” happen when a file encoded in one format is decoded in another format:

Let’s work to understand what’s happening here.

a. Let’s say that the string ‘Hi’ was encoded with ASCII.
where H is represented in decimal as 72 and
where i is represented in decimal as 105

How would it appear if decoded in UTF8?

b. Let’s say that the string ‘数据库’ (which is Chinese Shùjùkù which translates to “Database” in English is encoded with UTF-8
where 数 is represented in decimal as 25968 and
where 据 is represented in decimal as 25454 and
where 库 is represented in decimal as 24211
(according to https://www.compart.com/)


How would it appear if decoded with Latin-encoding (i.e., ASCII)? (show your work)


4. [20 points] Let $d$=2. Place the data from the price of the table below into a B+ tree indexing on the following table:

model speed ram hd price
10012.6640962502114
10022.102048250995
10031.42204880478
10042.804096250649
10053.202048250630
10063.2040963201049
10072.204096200510
10082.208192250770
10092.004096250650
10102.808192300770
10111.868192160959
10122.804096160649
10133.06204880529

5. [0 points] Get to work on Stage 3 Demo!