spark interview question part-7

Apache Spark Interview Questions

Download PDF of Apache Spark Interview Questions  

61.       Which all cluster manager can be used with Spark?

Ans:

Apache Mesos, Hadoop YARN, Spark standalone and

Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.

 

 

62.       What is a BlockManager?

Ans: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.

 

A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.

 

63.       What is Data locality / placement?

Ans: Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

 

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.

 

64.       What is master URL in local mode?

Ans: You can run Spark in local mode using local , local[n] or the most general local[*].

The URL says how many threads can be used in total:

·         local uses 1 thread only.

·         local[n] uses n threads.

·         local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

 

65.       Define components of YARN?

Ans: YARN components are below

ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers.

ApplicationMaster:  is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks.

NodeManager offers resources (memory and CPU) as resource containers.

NameNode

Container:  can run tasks, including ApplicationMasters.

 

 

66.       What is a Broadcast Variable?

Ans: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

 

67.       How can you define Spark Accumulators?

Ans: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.

 

68.       What all are the data sources Spark can process?

Ans:

·         Hadoop File System (HDFS)

·         Cassandra (NoSQL databases)

·         HBase (NoSQL database)

·         S3 (Amazon WebService Storage : AWS Cloud)

 

69.       What is Apache Parquet format?

        Ans: Apache Parquet is a columnar storage format

 

70.   What is Apache Spark Streaming?

Ans: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Click Below to visit other products as well for Hadoop

 Next

All Products link from www.HadoopExam.com

 

TRAINING'S 

·         Hadoop BigData Professional Training (3500INR/$79)

·         HBase (NoSQL) Professional Training (3500INR/$79)

·         Apache Spark Professional Training (3900INR/$89 for a week 3500INR/$75)

MAPR HADOOP AND NOSQL CERTIFICATION

·         MapR Hadoop Developer Certification

·         MapR HBase NoSQL Certifcation

·         MapR Spark Developer Certification (In Progress)

CLOUDERA HADOOP AND NOSQL CERTIFICATION

·         CCA50X : Hadoop Administrator

·         CCA-175 Cloudera (Hadoop and Spark Developer)

DATABRICKSA OREILLY SPARK CERTIFICATION

·         Apache Spark Developer

AWS : AMAZON WEBSERVICE CERTIFICATION

·         AWS Solution Architect : Associate

·         AWS Solution Architect: Professional

·         AWS Developer : Associate

·         AWS Sysops Admin : Associate 

MICROSOFT AZURE CERTIFICATION

·         Azure 70-532

·         Azure 70-533

DATA SCIENCE CERTIFICATION

·          EMC E20-007

EMC CERTIFICATIONS

·          EMC E20-007

SAS ANALYTICS CERTIFICATION

·         SAS Base A00-211

·         SAS Advanced A00-212

·         SAS Analytics : A00-240

·         SAS Administrator : A00-250 

ORACLE JAVA CERTIFICATION

·         Java 1z0-808

·         Java 1z0-809

ORACLE DATABASE CLOUD CERTIFICATION

·          1z0-060 (Oracle 12c)

·         1z0-061 (Oracle 12c)

NETAPP STORAGE CERTIFICATION

·          Storage Associate : NS0-145

·         Data Administrator (NCDA) : NS0-155

·         Data Administrator Clustered ONTAP : NS0-157

·         Implementation : SAN : NS0-502

·         Implementation : SAN,Clustered ONTAP : NS0-506

·         Implementation : Data Protection : NS0-511