41.
What
is coalesce transformation?
Ans: The coalesce transformation is
used to change the number of partitions. It can trigger RDD shuffling depending
on the second shuffle boolean input parameter (defaults to false ).
42.
What
is the difference between cache() and persist() method of RDD
Ans: RDDs can be cached (using RDD’s
cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel)
operation). The cache() operation is a synonym of persist() that uses the
default storage level MEMORY_ONLY .
43.
You
have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?
Ans: number _2 in the name denotes 2
replicas
;)
44.
What
is Shuffling?
Ans: Shuffling is a process of
repartitioning (redistributing) data across partitions and may cause moving it
across JVMs or even network when it is redistributed among executors.
Avoid shuffling
at all cost. Think about ways to leverage existing partitions. Leverage partial
aggregation to reduce data transfer.
45.
Does
shuffling change the number of partitions?
Ans: No, By default, shuffling doesn’t
change the number of partitions, but their content
46.
What
is the difference between groupByKey and use reduceByKey ?
Ans : Avoid groupByKey and use reduceByKey or
combineByKey instead.
groupByKey shuffles all the data, which
is slow.
reduceByKey shuffles only the results
of sub-aggregations in each partition of the data.
47.
When
you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the
result?
Ans: When called on datasets of type
(K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of
elements for each key [68]
48.
What
is checkpointing?
Ans: Checkpointing is a process of
truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or
local file system. RDD checkpointing that saves the actual intermediate RDD
data to a reliable distributed file system.
You mark an RDD
for checkpointing by calling RDD.checkpoint()
. The RDD will be saved to a file inside the checkpoint directory and all
references to its parent RDDs will be removed. This function has to be called
before any job has been executed on this RDD.
49.
What
do you mean by Dependencies in RDD lineage graph?
Ans: Dependency is a connection between
RDDs after applying a transformation.
50.
Which
script will you use Spark Application, using spark-shell ?
Ans: You use spark-submit script to
launch a Spark application, i.e. submit the application to a Spark deployment
environment.
Click Below to visit other products as well for Hadoop





