21.
How
do you define RDD?
Ans: A Resilient Distributed Dataset
(RDD), the basic abstraction in Spark. It represents an immutable, partitioned
collection of elements that can be operated on in parallel. Resilient
Distributed Datasets (RDDs) are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a
fault-tolerant manner.
·
Resilient:
Fault-tolerant and so able to recomputed missing or damaged partitions on node failures
with the help of RDD lineage graph.
·
Distributed:
across clusters.
·
Dataset:
is a collection of partitioned data.
22.
What
is Lazy evaluated RDD mean?
Ans: Lazy evaluated, i.e. the data
inside RDD is not available or transformed until an action is executed that
triggers the execution.
23.
How
would you control the number of partitions of a RDD?
Ans You can
control the number of partitions of a RDD using repartition or coalesce
operations.
24.
What
are the possible operations on RDD
Ans: RDDs support two kinds of
operations:
·
transformations
- lazy operations that return another RDD.
·
actions
- operations that trigger computation and return values.
25.
How
RDD helps parallel job processing?
Ans: Spark does jobs in parallel, and
RDDs are split into partitions to be processed and written in parallel. Inside
a partition, data is processed sequentially.
26.
What
is the transformation?
Ans: A transformation is a lazy
operation on a RDD that returns another RDD, like map , flatMap , filter ,
reduceByKey , join , cogroup , etc. Transformations are lazy and are not
executed immediately, but only after an action have been executed.
27.
How
do you define actions?
Ans: An action is an operation that
triggers execution of RDD transformations and returns a value (to a Spark
driver - the user program). They trigger execution of RDD transformations to
return values. Simply put, an action evaluates the RDD lineage graph.
You can think of
actions as a valve and until no action is fired, the data to be processed is
not even in the pipes, i.e. transformations. Only actions can materialize the
entire processing pipeline with real data.
28.
How
can you create an RDD for a text file?
Ans: SparkContext.textFile
29.
What
is Preferred Locations
Ans: A preferred location (aka locality
preferences or placement preferences) is a block location for an HDFS file
where to compute each partition on.
def getPreferredLocations(split: Partition):
Seq[String] specifies placement preferences for a partition in an RDD.
30.
What
is a RDD Lineage Graph
Ans: A RDD Lineage Graph (aka RDD
operator graph) is a graph of the parent RDD of a RDD. It is built as a result
of applying transformations to the RDD. A RDD lineage graph is hence a graph of
what transformations need to be executed after an action has been called.
Click Below to visit other products as well for Hadoop





