What have I learnt from this awesome course on Spark?

From Memory

Here is my attempt here is to write 50 takeaways from the course :

  1. Emacs configuration : Running from emacs was the first thing I wanted to do so as to quickly work through the various commands (setq python-shell-interpreter "C:/ProgramData/Anaconda2/envs/sparkl/Scripts/ipython")
  2. Spark makes it easy to write map-reduce jobs. The spark engine takes care of writing the map-reduce jobs and submits it to storage manager and resource scheduler
  3. Spark has APIs in Java, Scala, Python and R and thus makes it easy for a programmer to get up and running
  4. RDD - Resilient Distributed Dataset - Main data structure of Spark Core
  5. Basic architecture
    • sparkSQL + MLlib + Graph X + Spark Streaming
    • Distributed Storage + Resource Scheduler
    • Spark Core
  6. RDD traits : In-memory, distributed, fault-tolerant
  7. Pair RDD where the data is in the form of a key value pair
  8. Transformations and Actions can be applied on RDDs
  9. Actions will realize the RDD whereas Transformations will not. This lazy evaluation makes Spark very powerful
  10. Common transformations are
    • map
    • flatmap
    • filter
    • readText
    • join
    • groupBy
  11. Common Actions are
    • collect
    • take
    • first
  12. One can use persist to cache the RDD on nodes
  13. One can use cache so that the shuffle operation does not create operational bottlenecks
  14. Install spark using virtual box
  15. SparkDriver + DAG Scheduler + TaskSetManager are the three components work together after submitting a spark job
  16. SparkDriver gives the spark context based on which the client can communicate with the spark cluster
  17. One can use any distributed storage and resource scheduler. If you use Hadoop, then the distributed storage is HDFS and resource scheduler is YARN
  18. If you write any reduce job, the definition should take two arguments lambda x,y: x+y
  19. In order to work with SparkStreaming , you need to work with DStreams objects
  20. DStream Objects take in 4 arguments - two functions that define the way the stats are collected for entering window, leaving window, and two other arguments that specify window and sliding window details
  21. Spark Streaming is not truly event streaming - it is microbatching
  22. One can feed input from Kafka in to Spark Streaming
  23. For true event streaming, one might have to use Storm or Flink to get events and then Spark can further process them
  24. There are more methods for DataFrames than RDDs as mentioned in the implementation of PR
  25. Spark has MLlib that has collaborative filtering, regression, clustering and some of the common algos implemented
  26. Collaborative filtering via Alternating Least Squares is implemented in Spark
  27. Spark came out of Berkley - Databricks is the company that sells commercial version of Spark
  28. Spark can be run in shell mode, local mode or cluster mode
  29. Most of the developments are available in Java and Scala APIs. Python API usually does not incorporate the most recent developments in Spark APIs
  30. sort operation on RDDs
  31. Accumulate operation in the context of Spark
  32. You can specify functions at the node level and cluster level using the accumulate operator
  33. Page Rank - 10,000 ft overview
  34. Spark bundles the relevant libraries and sends them to each of the node that does data processing
  35. Spark can be run on a single machine too - with hadoop + virtual box(ubuntu). Ofcourse, it is not meant to be run on a single machine
  36. join operations creates a massive shuffle of data across the cluster
  37. sparkSQL is a fantastic way to get the power of mapreduce and SQL all at one place.
  38. sparkSQL can be run on Spark DataFrames
  39. namedtuple can be used to create classes on the fly
  40. One can easily ingest massive json array data in to HDFS with one line of code. This means that entire MRN data can be put in to HDFS
  41. RDDs are immutable
  42. One can create an RDD from a text file, streaming system, from memory, from another RDD
  43. Lazy evaluation makes Spark very powerful
  44. MLlib has specific classes for specific algos, i.e. Ratings class to capture user ratings
  45. Spark gives the power of combining transformation +action paradigm, SQL paradigm + DataFrame paradigm + Distributed storage
  46. LDA can be done using MLLib
  47. Java API looks very painful
  48. Spark can be used to write data to HDFS/Cassandra/any other place such as local disk
  49. Spark is written in Scala
  50. Once an use spark submit commands to submit a sparkjob

From my notes

  1. For iterative evaluation, MLlib is super powerful
  2. DataFrame - in memory data table
  3. Row object in DataFrame
  4. Problem with Map Reduce is that all intermittent steps write the data to HDFS
  5. Broadcast variables - used to store data across each of the nodes
  6. Functions available with PairRDDs are
    • keys
    • values
    • groupbyKey
    • reducebyKey
    • combinebyKey
  7. closures - Spark pulls all the relevant objects and classes needed to work on a job and distributes it to each node
  8. RDDs have lineage
  9. RDDs are fault tolerant
  10. No REPL in Hadoop Map Reduce
  11. Hadoop Map reduce can only do batch processing
  12. Spark advantages
    • express in intuitive way
    • interactive shell for python
    • data kept in memory
    • can do stream processing

Reflections

I think reflecting on the course learnings has helped me understand the content better. I think I am better equipped to go over Jose Course