It was [2019-12-16 Mon] ago that I spent three hours doing Beam Katas. The previous three days [2019-12-13 Fri] - [2019-12-15 Sun], I had immersed myself in to Beam, trying to understand it from JGarg’s course.

In this post, I will spend the next 15 minutes and do active recall of all the stuff :

  1. Beam is a data processing pipeline language that can be used to specify integration logic
  2. You can write the logic using Java, Scala or Python and execute the logic on any runner such as Flink, Storm, DataFlow, or any engine
  3. There are a set of functions that look very similar to map and reduce that can be used to create data pipelines
  4. PCollections
  5. Apache from its humble origins of Hadoop has now grown in a massive ecosystem with Yarn, Map Reduce, Spark, ZooKeeper, Paxos, HBase, Cassandra, Oozie, Flume, Pig, Hive, Flink, Storm, Kafka and similar stuff
  6. It is interesting to draw parallels to what is available on GCP. If you look at each of the available component in the Hadoop ecosystem, you will find a parallel component in the GCP world and in the AWS world.
  7. A solid understanding of the Hadoop Ecosystem will help one understand the various components of AWS and GCP ecosystem
  8. Some of the components in the AWS and GCP are managed services of Hadoop ecosystem elements
  9. the strange syntax code that is used to define the beam pipeline
  10. Options class to instantiate a beam pipeline
  11. One can run beam pipeline locally on a laptop
  12. Beam Pipeline is essentially a DAG and needs an executor to realize the operations
  13. Spark vs Flink - The fact that latter is a true stream processing engine whereas spark streaming can be thought of batch processing engine at heart that also does stream processing
  14. There are some methods that you need to override to come up with custom transformations
  15. There are also PCollections and PTransformations objects
  16. There are DoFn objects - What they do ? I do not recollect at all
  17. Almost everything can be done by DoFn objects
  18. There are light weight classes such as FlatMap that can do similar stuff that DoFn can do
  19. Read from file and Writing to a file are also PTransforms
  20. There are some join operations that can be done
  21. There is no way to keep track of column metadata. Unlike Spark that provides DataFrames, the user has to manually manage data via Apache Beam
  22. CountWords is the “hello world” example in Beam
  23. You need to run beam code on some engine - I think it can be run on Spark !
  24. There are concepts around windowing that can be used to look at building data pipelines using real time data
  25. Fantastic podcast - Frances Perry on Apache Beam real time

Sadly these are all what I remember about Beam from the three day immersion I had two weeks ago. My next step would be go over Beam again