I have spent the last 2 days working on Apache Beam and here are my learnings:

Learning from Data Flow for Dummies

  • If one were to build a hash tag auto-complete function, one might have to create a few map reduce tasks spread across a big cluster. This would involve creating a cluster, running the map reduce on the cluster and then collecting the outputs
  • Given a problem to solve, it is slightly tedious to convert the job in to map reduce tasks
  • Map Reduce
    • Ingredients - Processed Ingredients - Shuffle - Sandwich
  • If the reduce step is associative, then it makes things so much more simpler. The M + R can be done on each machine before the shuffle phase thus saving the massive operational time of shuffling a larger set of key value pairs
  • If all you want to run SQL stuff on the data, then there is no need to look beyond Big Query as it does everything
  • Evolution of technologies
    • 2003 - GFS
    • 2004 - Map Reduce
    • 2006 - Big Table
    • 2007 - Paxos
    • 2010 - Flume
    • 2011 - Dremel
    • 2012 - Spanner
    • 2013 - Mill Wheel
  • Flume Java in 2011 is the basis for DataFlow
  • Millwheel - Network nitty gritties
  • PCollection - Parallel Collection
  • DataFlow Runner optimizes the DAG
  • DataFlow sets up a cluster, does the work and destroys the cluster
  • Flatten - Synonymous to putting elements in a bag
  • Apache Beam has the potential to do what Map Reduce did in the open world

Active Recall from JGarg course

  1. Beam is a way to create framework agnostic data pipelines that can be written on any language of your choice, i.e. Java, Python or Go
  2. This beam code can then be run on any runner such as DataFlow, Flink, Spark etc
  3. The main data abstractions are PCollections and PTransforms.
  4. PTransforms form the nodes
  5. PCollections form the edges
  6. Beam has a set of io connectors. Python SDK can read from
    • Text
    • Avro
    • MongoDB
    • PubSub
    • Any file in the Google Storage bucket
  7. Python SDK can write the output to
    • File
    • Big Query
    • Avro
    • Parquet
  8. ParDo is akin to the map step
  9. CombinePerKey is akin to reduce step
  10. There are some derivative functions such as Filter, Map, FlatMap that can inturn be used to make the code efficient
  11. You can write custom datatypes in PCollection
  12. Comments can be put in place so that these appear in the DAG Step
  13. Comments should be unique and they cannot be replicated
  14. With open can be used that automatically runs the pipeline
  15. All the runner settings can be customized via Pipeline options class
  16. Using the Beam SDK workers, the respective runners create a DAG and execute the DAG
  17. Beam can be used to specify streaming data pipelines
  18. Concept of watermarking so that you can kickstart the computations in a window
  19. Windowing - Fixed, Sliding, Session, Global
  20. Triggers - Early, Ontime, Late triggers
  21. Accumulation Type - Discard or Additive
  22. Basis idea of processing time and event time interplay
  23. Apache Beam is a portable and unified framework for writing data pipelines
  24. Apache Beam is based on Flume Java and Millwheel
  25. Google has taken something very proprietary and has worked with open source Apache to make Beam an open source framework
  26. Beam + Flink will be increasingly adopted across enterprises
  27. Beam Capability matrix addresses four questions
    1. What is being computed ?
    2. How is it being computed ?
    3. When is it being computed ?
    4. What refinements are being applied ?
  28. DataFlow/Flink are DAG optimizers
  29. There are cases when one can give side inputs to PTransforms
  30. ParDo takes a class that inherits from beam.DoFn Class
  31. Custom PTransform inherit from beam.– I do not recollect this part
  32. One has to override the process function
  33. ParDo can be used to Filter, Map and many other things - it is a general purpose parallel operation
  34. One Transform can give rise to multiple PCollection using tagged output features
  35. PCollections can be joined via the join feature
  36. Beam Code can contain branching
  37. For debugging purpose, one can run Beam on a local mode
  38. Session window means - the window is open until a period of inactivity
  39. Sliding window contains a window time and sliding time
  40. Specifying Global window means essentially batch processing
  41. Default trigger is the on-time trigger that includes watermark
  42. Bounded and Unbounded datasets is a better way to think than batch and stream
  43. Apache Flink - Stream is at the heart of framework
  44. Java SDK has all the joins implemented - Python has only the basic join
  45. One can send messages to topic, let a beam program read the messages from a subscription and publish messages to another topic. This whole pipeline is an interesting way to learn PubSub and Beam
  46. Python 3 is still underway for Beam
  47. One needs to provide custom classes for serializing and deserializing custom data types in PCollection
  48. PCollection is immutable
  49. PCollection can be spread across multiple clusters
  50. PTransform functions provide an abstraction of map reduce tasks
  51. One can easily write a program in beam rather than actually think through Map Reduce tasks

What have I scribbled in the notes ?

  1. All elements of PCollection should have the same type
  2. Beam has functions to assign a timestamp to each element
  3. Pubsub by default assigns a timestamp to every message
  4. ParDo can take a class that overrides four functions - createaccumulator, addinput, mergeaccumulator, extractoutput
  5. Four questions
    • What results are calculated?
    • Where in event time are the results calculated ?
    • When in processing time are the results materialized ?
    • How do refinements of results relate ?
  6. Early and Speculative firings
  7. Tiggers prompt beam to emit data
  8. One can create composite triggers also
  9. Flink is StreamFirst Architecture
  10. Flink is much better than Storm and Spark
  11. ParDo is a generic function that can be used to filter dataset, do formatting, do type conversion, extract parts, do computations on each element
  12. Flume Java paper is available in the public domain

Takeaway

I have relearnt a ton of aspects by immersing in Apache Beam from [2019-12-30 Mon] to [2020-01-01 Wed]