Apache Spark Interview Questions



51.       Define Spark architecture

Ans: Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes.


52.       What is the purpose of Driver in Spark Architecture?

Ans: A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created.

·         Drive splits a Spark application into tasks and schedules them to run on executors.

·         A driver is where the task scheduler lives and spawns tasks across workers.

·         A driver coordinates workers and overall execution of tasks.


53.       Can you define the purpose of master in Spark architecture?

Ans: A master is a running Spark instance that connects to a cluster manager for resources. The master acquires cluster nodes to run executors.


54.       What are the workers?

Ans: Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool.


Sample Demo Session from Actual Training



55.       Please explain, how worker’s work, when a new Job submitted to them?

Ans: When SparkContext is created, each worker starts one executor. This is a separate java process or you can say new JVM, and it loads application jar in this JVM. Now executors connect back to your driver program and driver send them commands, like, foreach, filter, map etc. As soon as the driver quits, the executors shut down


56.       Please define executors in detail?

Ans: Executors are distributed agents responsible for executing tasks. Executors provide in-memory storage for RDDs that are cached in Spark applications. When executors are started they register themselves with the driver and communicate directly to execute tasks. [112]


57.       What is DAGSchedular and how it performs?

Ans: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.


DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially.


58.       What is stage, with regards to Spark Job execution?

Ans: A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.


59.       What is Task, with regards to Spark Job execution?

Ans: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage.

·         A task can also be considered a computation in a stage on a partition in a given job attempt.

·         A Task belongs to a single stage and operates on a single partition (a part of an RDD).

·         Tasks are spawned one by one for each stage and data partition.


60.   What is Speculative Execution of a tasks?

Ans: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.


Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.

Click Below to visit other products as well for Hadoop

CCA-175 CertifcationCCA-500 Hadoop Administrator ExamHBase Certifcation MCHBD (MapR HBase)Data Science CertifcationHadoop Training with Hands On LabHadoop Package Deal