Apache Spark Interview Questions



41.       What is coalesce transformation?

Ans: The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ).


42.       What is the difference between cache() and persist() method of RDD

Ans: RDDs can be cached (using RDD’s cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY .


43.       You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?

Ans: number _2 in the name denotes 2 replicas

Hadoop Package Deal

44.       What is Shuffling?

Ans: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.

Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.


45.       Does shuffling change the number of partitions?

Ans: No, By default, shuffling doesn’t change the number of partitions, but their content


46.       What is the difference between groupByKey and use reduceByKey ?

Ans  : Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.


47.       When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?

Ans: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key [68]


48.       What is checkpointing?

Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.


You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.


49.       What do you mean by Dependencies in RDD lineage graph?

Ans: Dependency is a connection between RDDs after applying a transformation.


50.   Which script will you use Spark Application, using spark-shell ?

Ans: You use spark-submit script to launch a Spark application, i.e. submit the application to a Spark deployment environment.

Click Below to visit other products as well for Hadoop

CCA-175 CertifcationCCA-500 Hadoop Administrator ExamHBase Certifcation MCHBD (MapR HBase)Data Science CertifcationHadoop Training with Hands On LabHadoop Package Deal