HW5 – Tim Weninger, PhD

Q1. [20 pts]

We have created a zips table under the cse30246 database; You should be able to access it via select * from zips limit 10. Use this table to answer the following questions. Make sure that you report the answer the first time you run the query, multiple uses of the same query get treated differently than the first run.

Note: There is no primary key (for educational purposes only – you should always have a primary key in any table), the city and state columns are not indexed and the city_idx column is indexed.

Tip: use the count(*) function, e.g., select count(*) from zips… to keep from returning millions of results for some queries.

Tip: Remember that you can use desc zips; in order to look at the schema of the table.

a. I expanded the number of zip codes for this assignment. I basically added zeros to the zipcodes to create lots of data. How many rows are in this table?

b. How long does it take to find the city where the zip code is 46556? Why is it so fast/slow?

Issue this query to get a random city and state:

SELECT city, state_id FROM zips ORDER BY RAND() LIMIT 1;

Write down the result – It is important that you use the city from this query in the remainder of problem

c. How long does it take to find the zipcode records (from non-index city-column) with the same name as your randomly chosen city?

Consider b and c – Explain the runtime difference, if any.

d. Now try to find your zip code record using the indexed city_idx column. How long does it take?

What is the difference between c and d? Explain the runtime difference, if any.

e. For this query, remove the last three letters of your city’s name, for example, if you have ‘South Bend’, query for all records starting with ‘South B’. How long does it take to find all records starting with the first few letters of your city using the non-index city-column? Remember to use the LIKE keyword.

Consider e and c. Why does one take longer than the other?

f. Now try the same few letters to find the zip code records of your city using the indexed city column. How long does it take?

Consider f and e. Why does one take longer than the other?

g. We could also try to look up cities ending with the last two letters of your city. For example, if your city is ‘South Bend’, you should query for all cities ending with ‘nd’. Using the non-indexed city-column, how long does this take?

h. Now try using the indexed column city_idx to find cities ending with the same two letters as your city. How long does this take?

Compare and explain the runtimes of g and h.

i. Interestingly, zip codes generally decrease from west to east (California has 90000, and New York has 10000). How long does it take to count the zip codes greater than 20000 and less than 80000?

j. Depending on the table-collation, MySql performs lexigraphical ordering of strings (A<B<a<b<c). How long does it take to discover how many cities are “greater” than the full name of your city using the non-index city column?

k. How long does it take to discover how many cities are greater than the full name of your city using the index city column?

Consider k and j, why is one faster than the other?

Q2. [40 points]

How does Google use an Inverted Index? I used to teach a course on this, and in that course we would make a Google, but we called it the Simple Information Retrieval System (SIRS).

The goal of Q2 is to download, read, and understand how SIRS uses an inverted index to answer queries.

To start log into db.cse.nd.edu. In your home directory. Clone SIRS from GitHub, install a virtual environment, and activate it

Prof Bui may have corrupted your python-path. Check that >which python3 does not point to pbui’s path. If it does, please edit your bashrc to remove pbui’s python from your path or call /usr/bin/python3 instead of python3 in the instructions here.

git clone https://github.com/tweninger/SIRS.git
cd SIRS
/usr/bin/python3 -m venv venv
source ./venv/bin/activate

Ensure that:

~/SIRS$ python -V
Python 3.6.8

Let’s start by getting a few libraries that are needed by SIRS:

pip install --upgrade pip
pip install scrapy bs4 numpy bitarray flask

The first goal is to crawl the Web for data. I have a script that does that in ./index/crawler.py Feel free to explore it. Before we begin, we need a place to put the data. Let’s make a data directory:

mkdir data

Then we can start a crawl. By default this crawler starts at http://cse.nd.edu and goes from there. You are welcome to change or add to that list.

python -m index.crawler

This will run and run and run basically forever. Give it a minute or two to get some data and then use Ctrl+C to kill it.

You’ll notice that the data directory now has a tar file in it with a bunch of Web pages:

ls -al data

You can explore that tar file with:

tar -tf data/NDCrawler_result_large.tar

To show the Web pages you just crawled.

The next thing to do is create an Inverted Index that maps keywords to the documents:

python -m index.indexer

Once this runs you can look back into the data directory to see the index files.

ls data 

anc_idx.txt         doc_idx.txt          idx.txt  NDCrawler_result_large.tar  webgraph.txt
doc_idx_offset.txt  idx_term_offset.txt  lex.txt  runs

Each of these files describes a different part of the index. Take a look through each, print the first line (or two or three depending on how bit it is) and describe what each file contains. For example:

(venv) tweninge@db:~/SIRS$ head data/doc_idx_offset.txt
0
127
256
408
544
675

The doc_idx_offset file describes the offset for each document in the doc_idx.txt file.

Finally, we can use the inverted index to create a Google-type search engine. To do this you’ll execute:

python -m webapp.SIRS

You’ll notice that this probably fails because it, by default, tries to use port 80. So you’ll need to edit the SIRS.py file so that it opens another port. I have opened ports 8080-8089 on db.cse.nd.edu for your use. A port can only be used by one service, so if one fails it may mean that another student is using it. Keep trying different ports until you find one that is open.

Once you find a port that works you will be able to browse to db.cse.nd.edu:8080 (or whatever port). And you’ll see the SIRS search engine.

Issue a query and describe your first few results:

**Make sure to Ctrl+c to close the server when you’ve finished so that others can use your port.

Q3 [20 points]

Suppose that three consistency constraints on a database are:

0≤A≤B, 0≤C≤D, and 0≤X≤Y.

Tell whether each of the following independent transactions traces preserves consistency if they were to be executed completely.

A = A+B; B = A+B;
D = C+D; C = C+D;
X = Y+1; Y = X+1;

Q4 [20 points]

For each of the transactions in Question 3, add the read and write actions to the computation and show the effect of the steps on the main memory, disk, and in an UNDO log.

.Assume initially that {A,C,X}=3 and {B,D,Y}=12 on disk.