Using an Index

Indexing for Joins

IN CLASS – Why would an index make joins faster?

Would an index improve a join if the relation is sorted on the joinable columns?

A lot of times an index is only used to test the existence of data … Not actually return anything

IDName
1Allison
2Ted
3Phil

To find the name for an ID an index in ID would be nice.

What if an index is created on (ID, Name) together?

                Would a lookup by name be faster?

                                No… because the first element in the index is the id.

Comparison

To compare joins between indexed and non-indexed columns

>mysql -p  ##gets you in to mysql

once in mysql:

>use cse30246;
>show tables;

We’ll use cfb_team table

>desc cfb_team;

Let’s try to join where there is no index.

but first turn on profiling:

> set profiling = 1;
> select count(*) from cfb_game A, cfb_game B where A.stadium_id = B.stadium_id;
+----------+
| count(*) |
+----------+
| 392414   |
+----------+
1 row in set (2.55 sec)
> show profile;

sending data takes the most time – this is because of a table scan!

(don’t run same query twice, results are cached – if instantaneous result happens then make some small syntax change)

> select count(*) from cfb_game A, cfb_game C where A.game_id = C.game_id;
+----------+
| count(*) |
+----------+
|     7220 |
+----------+
1 row in set (0.02 sec)
> show profile;

This is much faster because it is indexed.

We can even do this on a much bigger table:

> select count(*) from cfb_play A, cfb_play B where A.clock = B.clock;

And it’s going to take a long long while.

> select count(*) from cfb_play A, cfb_play B where A.game_id = B.game_id and A.play_number = B.play_number;
+----------+
| count(*) |
+----------+
|  1327076 |
+----------+
1 row in set (3.67 sec)

The play_number column is also part of the primary key index. But its not the first item.

How long should this query take?

> select count(*) from cfb_play A, cfb_play B where A.play_number = B.play_number;

(Forever)

It’s still slow because the play_number column is part of the index in the second column position. You can’t quickly search/join like this. 

If primary key was ordered with play_number first, then you could join and search on play_number (but then not on game_id)

I created a new table called cfb_game_noidx that is a shuffled version of cfb_game and also does not have any indexes.

> desc cfb_game_noidx;
> select * from cfb_game_noidx limit 10;

We can join the indexed game with the non-indexed game with the following query. How long will it take?

> select count(A.game_id) from cfb_game A, cfb_game_noidx B where A.game_id = B.game_id;
+------------------+
| count(A.game_id) |
+------------------+
|             7220 |
+------------------+
1 row in set (0.01 sec)

Almost as fast as the indexed self-join we did before!

Could be slightly slower on big tables because of different types of joins.

I’ve created yet another table called cfb_game_noidx_sorted that is sorted by game_id  (just like cfb_game)

> desc cfb_game_noidx_sorted;
> select * from cfb_game_noidx_sorted limit 10;

So lets check it out…

> select count(A.game_id) from cfb_game A, cfb_game_noidx_sorted B where A.game_id = B.game_id;
+------------------+
| count(A.game_id) |
+------------------+
|             7220 |
+------------------+
1 row in set (0.03 sec)

non-index sorted joined to index is almost as fast as index join index.

Unfortunately, 

> select count(A.game_id) from cfb_game_noidx_sorted A, cfb_game_noidx_sorted B where A.game_id = B.game_id;
+------------------+
| count(A.game_id) |
+------------------+
|             7220 |
+------------------+
1 row in set (2.10 sec)

sorted join sorted is not fast…. because mysql doesn’t know that its sorted 🙁

> select count(*) from cfb_game A, cfb_game C where A.stadium_id = C.stadium_id; 

We can always create an index –

> create index game_idx on cfb_game (stadium_id) using btree;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
> show indexes from cfb_game;

This can take varying amounts of time depending on the size and cardinality of the column(s) being indexed.

> select count(*) from cfb_game A, cfb_game C where A.stadium_id = C.stadium_id;
+----------+
| count(*) |
+----------+
|   392414 |
+----------+
1 row in set (0.09 sec)

We should probably drop it.

> drop index game_idx on cfb_game;
> create index game_idx on cfb_game (stadium_id) using hash;

49 is Notre Dame Stadium;

> select count(*) from cfb_game where stadium_id > 49;

How fast will this be?

> drop index game_idx on cfb_game;

Pros of creating an index:

                Fast lookups and joins
                Certain indexes store things in sorted order therefore order by is quick.

Cons

                Take a lot of extra disk space.

> create table cfb_test like cfb_game;
> insert into cfb_test from select * from cfb_game;

Size on disk?

Some things can’t be looked up effectively in an index.

General rule of thumb: create an index when you know that you will need quick results or perform joins on those columns many many times.

Do not create indexes on columns that you never query or join – this just wastes space.