The following are the takeaways from the course:

  • What is the project all about : It entails storing data in to Hadoop and then using Spark to do data cleansing
  • 5V’s
    • Volume
    • Variety
    • Velocity
    • Veracity
    • Value
  • 1024 TB = 1 Petabyte
  • 1024 PB = Exabyte
  • Types of Data : Structured, Unstructured and Semi-Structured
  • Hadoop 1.0 - Resource Manager and Job Tracker. Resource Manager part of Map Reduce
  • Hadoop 2.0 - YARN takes care of resource management - Resource management and Job scheduling stuff is taken care of.
  • YARN sits on HDFS and Map Reduce works with YARN
  • Spark can use YARN on HDFS
  • Resource Manager - Node Manager processes in YARN
  • Pig and Hive run on Map Reduce which in turn sits on YARN which in turn sits on HDFS
  • Cost of 3 node cluste $3 dollars per day - $100 dollars per month
  • Hive is not a database - It points to data stored in HDFS. It stores metadata
  • Storing data in HDFS
  • Spark Unified analytics engine for large-data processing
  • Installing Spark on Colab
    • You need to install java, spark all by yourself everytime you start spark environment
  • Learnt about AWS Glue components - Crawlers, Jobs and Triggers
  • Learng about AWS Athena
  • Spent about 3 hrs replicating the tasks shown in the lecture
  • Learnt that there is an R and Python interface for Athena
  • Managed to get Datamunging job done via AWS Glue
  • User Defined functions
  • Joins
  • Using AWS Lambda triggers - How to launch a job in AWS Glue using Lambda triggers
  • Simple usecase but have learnt the entire workflow