Distributed SQL

Limitations of Map Reduce:

Have to use M/R model
Not Reusable
Error prone

For complex jobs:

Multiple stage of Map/Reduce functions

Everything gets written to disk in the middle and at the end.

Loops are impossible (impossibly slow)


Apache Spark was invented to make this better – said to operate at 100x faster than MR because it keeps things in memory.

Here’s how it works: RDD (Resilient distributed datastorage)

Everything works with RDDs.

$> hdfs namenode
$> source ./venv/bin/activate

(venv) $>python3

from pyspark.sql import SparkSession, SQLContext
from pyspark import SparkContext, SparkConf
spark = SparkSession.builder.getOrCreate()

Open htop

Open http://db.cse.nd.edu:4040/

df = spark.read.json("hdfs://db.cse.nd.edu:9000/reddit/RC_2011-08.bz2")
spark.sql('select count(*) from comments').collect()
spark.sql('select * from comments order by score desc limit 2').head()
spark.sql('select * from comments order by score desc').head(5)

Does Spark replace Hadoop?

No, Hadoop is the storage engine.