Spark Functions

2021-03-18

SparkContext.textFile()

While you are in driver mode or use ./bin/spark-shell to submit jobs to executors, there are several rules you have to obey.

  1. The file path must exist both on the driver node and all executor node.
  2. Based on rule 1, when you store different content in the file located at the same path both on your local node and executor node, the textFile result depends on the line number of the file content at your local node and the file content at the executor node.

Dateset transformations

Referrence: https://spark.apache.org/docs/3.1.1/rdd-programming-guide.html#transformations

Dataset.createOrReplaceTempView() Dataset.registerTempTable()

This method allows you to create a lazy view in spark and it’s not cached accross the spark cluster.

If you wanna cache it, invoke method Dataset.table().cache() explicitly.

Dataset.join(anotherDataset, …)

To use this method, you have to provide three parameters.

Difference with ds.joinWith

ds.joinWith is the type belongs to `Type-Preserving Joins

Dateset.coalesce() vs Dataset.repartition()

Before we discuss the difference between them, we should know that the cost of repartition operation is very expensive. You can use coalesce to minimize the data movement among the executors.

Stackoverflow answer: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce