- Introduction to Apache Spark
- Features of Apache Spark
- Apache Spark Stack
- Introduction to RDD's
- RDD's Transformation
- What is good and bad In MapReduce?
- Why to use Apache Spark
Module 2: Cloudera QuickStart VM Installation (Hands-on Lab + PDF Download) (Available Length 34 Minutes)
- Include Hadoop
- Include Apache Spark
- Include Hive
- Include Sqoop
- Include Hue
Module 3: Deep Dive in HDFS: (Available Length 48 Minutes)
- HDFS Design
- Fundamental of HDFS (Blocks, NameNode, DataNode, Secondary Name Node)
- Rack Awareness
- Read/Write from HDFS
- HDFS Federation and High Availability (Hadoop 2.x.x)
- HDFS Command Line Interface
Module 4: Spark Shell Hands On Using HDFS (Hands-on Lab + PDF Download) (Available Length 34 Minutes)
- Spark Shell Introduction
- Create file using Hue
- Spark Shell extracting file from HDFS
- Create RDD from HDFS file
Module 5: Programming with RDD Part-1 (Hands-on Lab + PDF Download) (Available Length 28 Minutes)
- Creating new RDD
- Transformations on RDD
- Lineage Graph
- Actions on RDD
- RDD Concepts on Persist and Cache
- Lazy evaluation of RDD
Module 6: Scala/Spark Functional Programming (Hands-on Lab+ PDF Download) (Available Length 28 Minutes)
- Using Function Literals
- Anonymous Functions
- Define a function which accepts another function
Module 7: RDD Transformation Programming in Depth (Hands-on Lab+ PDF Download) (Available Length 24 Minutes)
- Hands on and core concepts of map() transformation
- Hands on and core concepts of filter() transformation
- Hands on and core concepts of flatMap() transformation
- Compare map and flatMap transformation
Module 8: Apache Spark in Action Depth (Hands-on Lab+ PDF Download) (Available Length 36 Minutes)
- Hands on and core concepts of reduce() action
- Hands on and core concepts of fold() action
- Hands on and core concepts of aggregate() action
- Basics of Accumulator
- Hands on and core concepts of collect() action
- Hands on and core concepts of take() action
- Ordered access of RDD
Module 9: Apache Spark Execution Model (Includes PDF Download Available Length 35 Minutes)
- How Spark execute program
- Concepts of RDD partitioning
- RDD data shuffling and performance issue
Module 10: Apache Spark PairRDD (Include PDF Download Available Length 45 Minutes)
- Core concepts of PairRDD
- Creation of PairRDD
- Aggregation in PairRDD
- Aggregation functions understanding in depth
a) How reduceByKey() work conceptually?
b) How foldByKey() work conceptually?
c) How combineByKey()work conceptually?
Module 11: Spark PairRDD HandsOn Lab (Hands-on Lab+ PDF Download) (Available Length 12 Minutes)
- reduceByKey
- foldByKey
- combineByKey
- groupByKey
Module 12 : Spark PairRDD Joining, Zipping and Grouping (Hands-on Lab+ PDF Download) (Available Length 30 Minutes)- reduceByKey versus groupByKey performance issue
- cogroup
- zip
- joining (left, right, inner etc.)
Module 13-A: Understanding Hadoop SequenceFile (Available Length 7 Minutes)
Module 13-B: Creating Seqnce File and Processing using SPark (Hands on Lab)-Part-1 (Hands-on Lab+ PDF Download) (Available Length 23 Minutes)
- Creating SequenceFile using TSV file
- Loading Data in Apache Hive
- Processing SequnceFile as an RDD.
Module 14 : Spark Shared Variables ( PDF Download) (Available Length 27 Minutes)- Shared Variables: Broadcast Variables (Available Length 14 Minutes)
- Shared Variables: Accumulators (Available Length 13 Minutes)
Module 15 : Spark Accumulator (Hands-on Lab+ PDF Download) (Available Length 14 Minutes)
- Word count and Character Count
- Counting Bad records in a file
Module 16 : Spark BroadCast Variable (Hands-on Lab+ PDF Download) (Available Length 12 Minutes)- Joining two csv files one as a Broadcasted Lookup table
Module 17 : Spark API : BroadCast Variable, Filter Functions and Saving File to HDFS (Hands-on Lab+ PDF Download) (Available Length 13 Minutes)
Module 18 : Spark API : Spark Join, GroupBy and Swap function (Hands-on Lab+ PDF Download) (Available Length 12 Minutes) Module 19 : Spark API : Remove Header from CSV file and Map Each column to Row Data (Hands-on Lab+ PDF Download) (Available Length 10 Minutes) Module 20 : Spark SQL ( PDF Download) (Available Length 27 Minutes)
- HiveContext
- Schema RDD replaced by DataFrame API
- History of SparkSQL
- Catalyst Optimizer
Module 21 : SparkSQL HandsOn Sessions (Hands-on Lab+ PDF Download) (Available Length 20 Minutes)
- Hive Configuration
- Create Hive table using Spark
- Load Data in HIve table using Spark
- Create another table using DataFrame
Module 22 : Implementing Business Logic using SparkSQL (Hands-on Lab+ PDF Download) (Available Length 25 Minutes)- Loading CSV file
- Spark Case classes (To create schema for csv file)
- Convert RDD to DataFrame using DataFrmae API for query data
- Using SQL query on DataFrame
Module 23 : Spark Streaming in Depth Part-1 (PDF Download) (Available Length 26 Minutes)
- Real/Near real time data processin
- Streaming Sources and Sinks
- DStream (Discretized Stream)
- Dtream Concepts
- Stock Visualization Example (How Streaming Helpful)
Module 24 : Spark Streaming in Depth Part-2 (PDF Download ) (Available Length : 22 Minutes)- Execution of Spark Streming
- Spark Streaming Transformation (Stateless and Stateful)
- Comining multiple DStream
- Understanding transform() operator
Module 25 : SPARK STREAMING PART-3 STATEFULL (WINDOW) TRANSFORMATIONS (Available 20 Minutes)
- Window Transformation
- Window Duration and Sliding Duration
- DStream Opeations
- WordCount in DStream
Module 26 : Basics of Machine Learning and Data Science (Available Length : 30 Minutes)
- Basics of ML and Data Science
- Example of Machine Learning
- Supervised and Unsupervised Learning
- Key terminology e.g. features, training and testing
- How to choose right algorithm
- Common steps of Machine Learning
- Collect data
- Prepare Input data
- Analyze Input data
- Train the algorithm
- Test the algorithm
- Use the Algorithm
Module 27 : SPARK STREAMING: REAL TIME STOCK MARKET DATA PROCESSING (HANDS-ON LAB + PDF Download Available Length : 21 Minutes) - Problem Statement
- Data Format
- Writing Stream script to filter Bigger Volume data
- Write results back to HDFS file System
Module 28 : SPARK STREAMING: REAL TIME STOCK MARKET DATA MAVEN APPLICATION ( Hands-on Lab+ PDF Download) (Available Length 37 Minutes)
- Understanding Maven pom.xml
- Importing Scala Application in eclipse
- Creating Application JAR file using eclipse and Maven
- Run Spark Streaming Application
- Process data using Spark Stream Application
Module 29 : SPARK STREAMING & SPARK SQL: REAL TIME MARKET DATA APPLICATION (Hands-on Lab ) (Available Length 18 Minutes)
- Create Spark Streaming Application
- Use SparkSQL in Spark Streaming Application
- Querying data
Module 30 : SPARK STREAMING WINDOW FUNCTION& SPARK SQL JOIN: REAL TIME MARKET DATA APPLICATION (Hands-on Lab) (Available Length 7 Minutes)
- Create Spark Streaming Application
- Use SparkSQL in Spark Streaming Application
- Joining data sets , with real-time streaming data
- Using Spark Streaming window function to calculate , running rum of trade volume.
Module 31 : SPARK ADVANCED : DATA PARTITIONING ( PDF Download) (Available Length 26 Minutes) - What is Partitioning and why?
- Data Partitioning example using Join (Hash Partitioning)
- Understand Partitioning using Example for get Recommendations for Customer
- Understand Partitioning code using Spark-Scala
- Operations which create Partitioned RDD
- Operation which get benefit of Partitioning
- Operation that affect the partitioning
Module 32 : SPARK PAIR RDD FUNCTIONS : In Depth (PDF Download) - reduceByKey() (Available Length 17 Minutes)
- groupByKey() (Available Length 14 Minutes)
- combineByKey() (Available Length 13 Minutes)
- foldByKey() (Available Length 15 Minutes)
- aggregateByKey() (Available Length 11 Minutes)
- Comparision Between Function (Available Length 11 Minutes)