A Big Data Hadoop and Spark project for absolute beginners

The following are the takeaways from the course:

What is the project all about : It entails storing data in to Hadoop and then using Spark to do data cleansing
5V’s
- Volume
- Variety
- Velocity
- Veracity
- Value
1024 TB = 1 Petabyte
1024 PB = Exabyte
Types of Data : Structured, Unstructured and Semi-Structured
Hadoop 1.0 - Resource Manager and Job Tracker. Resource Manager part of Map Reduce
Hadoop 2.0 - YARN takes care of resource management - Resource management and Job scheduling stuff is taken care of.
YARN sits on HDFS and Map Reduce works with YARN
Spark can use YARN on HDFS
Resource Manager - Node Manager processes in YARN
Pig and Hive run on Map Reduce which in turn sits on YARN which in turn sits on HDFS
Cost of 3 node cluste $3 dollars per day - $100 dollars per month
Hive is not a database - It points to data stored in HDFS. It stores metadata
Storing data in HDFS
Spark Unified analytics engine for large-data processing
Installing Spark on Colab
- You need to install java, spark all by yourself everytime you start spark environment
Learnt about AWS Glue components - Crawlers, Jobs and Triggers
Learng about AWS Athena
Spent about 3 hrs replicating the tasks shown in the lecture
Learnt that there is an R and Python interface for Athena
Managed to get Datamunging job done via AWS Glue
User Defined functions
Joins
Using AWS Lambda triggers - How to launch a job in AWS Glue using Lambda triggers
Simple usecase but have learnt the entire workflow

Contents