CAP and HBase

NoSQL – Overview

Not only SQL.
CAP Theorem – Brewer’s Theorem

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Types of databases

  • Relational
  • Key-Value
  • Column-Oriented
  • Document-Oriented

Row-Oriented Databases

+ Easy to add rows and data
– Might read unnecessary data

Column – Oriented Databases

+ only read relevant data
– tuple writes require multiple accesses

Column-Oriented DBs: suitable for read-mostly, read-intensive, large data repositories

Fundamental difference is only in the storage layout – but this has huge implications.

Example – read efficiency

select team from cfb_game; #real-world tables are much much more complex.

Row-Oriented DB returns entire table into memory and then projects on team

Columnar DB reads only the column we need.

Example – compression efficiency

Columns compress better than rows

  • Typical row-store compression ratio 1 : 3
  • Column-store 1 : 10

Why? 2 minute guess reasons.

Rows contain values from different domains => more entropy, difficult to dense-pack

Columns exhibit significantly less entropy than rows

Example difference in scanning.

Columnar Optimizer can retrieve data from two different disks at the same time and combine them in the project operator

Tradeoffs – as number of columns increases so does the benefits of columnar DBs.
<- Tuple width is in bytes not columns

“Cycles per disk byte” –
accounts for many disks, many CPUs, competing traffic.

  • Many CPUs and many disks favors columnar
  • Single CPU single disk favors row store.

Row Key

Each Hbase row has to have a rowkey, this ties everything together Can be any type of data, even a complex object

Column Families

Columns in HBase are grouped into column families

Each column family can have multiple rows that are stored together on disk – optimized for compression and storage.

Let’s say you are a Bank and want to make a Person Table

name, address, birthdate, gender, ssn, ethnicity, fica_score, acct balance, height

What column families would you make?

A column family is defined in the table, a column is not – they are defined during insertion

Each value in a column name is a triplet = <column_name, value, timestamp> e.g. ‘cf:col, val’ where cf is called the column family qualifier – “qualifier” because it the text must “qualify” as printable chars.

Hbase on dsg1

$hdfs namenode
$hdfs datanode
$hdfs zookeeper
$hdfs regionserver start
$hbase master start
$hbase
$hbase shell
>create ‘person’, ‘pd’, ‘d’, ‘b’    //for personaldata, demographics and bank
>scan ‘person’
>put 'person', 'row1', pd:last_name', 'Tompson’
>scan ‘person’

$ hadoop fs -ls /hbase      //to show how the docs are stored

db.cse.nd.edu:9870/hbase/data/default // stores columns as files.  Nothing there?!?!

It’s in memstore, we need to flush to disk. This can be done automatically when mem is filled or…

>flush 'person'

db.cse.nd.edu:9870/hbase/data/default

>put 'person', 'row2', pd:last_name', 'Tompson'
>put 'person', 'row2', pd:first_name', 'Mary'
>scan ‘person’ //now a birthday
>put 'person', 'row1', pd:first_name', 'Bob'
RowKeytimestampFirst_NameLast_name
row11 Tompson  
 4BobTompson  
 5BobThompson  
Row22 Tompson  
 3MaryTompson  

What does the current database look like in Tuple-Storage format?

>scan ‘person’ // updated with new timestamp – old value kept (depending on configuration)
>get ‘person’, ‘row1’
>get ‘person’, ‘row1’, ‘pd’
>get ‘person’, ‘row1’, ‘pd:first_name’
>put 'person', 'row1', 'pd:last_name', 'Thompson' //lets change Bob’s last name
>scan ‘person’
>scan 'person', {COLUMNS => ['pd'],FILTER => "   (SingleColumnValueFilter('pd','first_name',=,'regexstring:M*ry',true,true))"}

>show_filters
>drop ‘person’   //does not work
>disable ‘person’ //do this first
>drop ‘person’