Limitations of Map Reduce:
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Everything gets written to disk in the middle and at the end.
Loops are impossible (impossibly slow)
Spark
Apache Spark was invented to make this better – said to operate at 100x faster than MR because it keeps things in memory.
Here’s how it works: RDD (Resilient distributed datastorage)
Everything works with RDDs.
$> hdfs namenode $> source ./venv/bin/activate (venv) $>python3 from pyspark.sql import SparkSession, SQLContext from pyspark import SparkConf, SparkContext spark = SparkSession.builder.getOrCreate()
Open htop
Open http://db.cse.nd.edu:4040/
df = spark.read.json("hdfs://db8.cse.nd.edu:9000/reddit/RC_2010-11.bz2") df.createTempView("comments") df.printSchema() df.show()
spark.sql('select count(*) from comments').collect() spark.sql('select * from comments order by score desc limit 2').head() spark.sql('select * from comments order by score desc').head(5)
Does Spark replace Hadoop?
No, Hadoop is the storage engine.