Big Data (Hadoop + Spark )

Hadoop and spark Development Course Content

Course Duration: 40 Hours

Course Details & Attend Live Sessions

Module 1– Linux prerequisites required for Hadoop

  • Linux Basics

Module 2 – Introduction to Big data

  • What is Big data?
  • Sources of Big data
  • Categories of Big data
  • Characteristics of Big data
  • Use-cases of Big data
  • Traditional RDBMS vs Hadoop

Module 3 –Introduction to Hadoop

  • What is Hadoop?
  • History of Hadoop
  • Understanding Hadoop Architecture
  • Fundamental of HDFS (Blocks, Name Node, Data Node, Secondary Name Node)
  • Block Placement &Rack Awareness
  • HDFS Read/Write
  • Under/Over Replication
  • Types of Scaling(Horizontal/Vertical)
  • Drawback with 1.X Hadoop
  • Introduction to 2.X Hadoop
  • High Availability

Module 4 – HDFS

  • Understanding Hadoop configuration files
  • Hadoop Components- HDFS, MapReduce
  • Overview Of Hadoop Processes
  • Overview Of Hadoop Distributed File System
  • The building blocks of Hadoop
  • Hands-On Exercise: Using HDFS commands

Module 5 – Map Reduce 1(MRv1)

  • Map Reduce Introduction
  • How Map Reduce works?
  • Communication between JobTracker and TaskTracker
  • Anatomy of a Map Reduce Job Submission

Module 6 – MapReduce-2(YARN)

  • Limitations of Current Architecture
  • YARN Architecture
  • Node Manager & Resource Manager

Module 7 –Hive

  • Introduction to Apache Hive
  • Architecture of Hive
  • Hive data types
  • Exploring hive meta store tables
  • Types of Tables in Hive
  • Partitions (Static & Dynamic)
  • Buckets & Sampling
  • Indexes& Views
  • Developing hive scripts
  • Parameter Substitution
  • Difference between order& sort by, Cluster& distribute by
  • Different compressions in HIVE
  • File Input formats (Text file, RC, ORC, Sequence, Parquet)
  • Optimization Techniques in HIVE
  • Creating UDFs
  • Hands-On Exercise
  • Assignment on HIVE

Module 8 – Sqoop

  • Introduction to SQOOP& Architecture
  • Import data from RDBMS to HDFS
  • Importing Data from RDBMS to HIVE
  • Exporting data from HIVE to RDBMS
  • Handling incremental loads using sqoop
  • Hands on exercise

Module 9 – Hbase

  • Introduction to HBASE
  • Exploring HBASE Master & Region server
  • Create table
  • List table
  • Disabling table
  • Enabling table
  • Dropping table
  • Hands on exercise on HBASE

Module 10-Scala Basics

  • Introduction to Functional Programming
  • Interactive Shell – REPL, Data types, Variables, Expressions, Conditional statements, Loops – For comprehension
  • Pattern Matching in Scala with Match expression
  • Simple Functions and their variants, Tail Recursion, Functions as Objects aka Anonymous functions, Higher Order Functions
  • Scala Collections and the usage of higher order methods on Collections
  • Classes and Objects, Class Constructors in Scala, Case classes, Abstract and Generic Class
  • Exception Handling in Scala
  • Traits in Scala, Properties of Traits
  • Magic Apply method
  • Singleton and Companion objects
  • Implicits in Scala – Implicit parameters, def, classes

Module 11-Getting started with Spark

  • What is Apache Spark & Why Spark?
  • Spark History
  • Unification in Spark
  • Spark ecosystem Vs Hadoop
  • Spark with Hadoop
  • Introduction to Spark’s Python and Scala Shells
  • Spark Standalone Cluster Architecture and its application flow

Module 12– Programming with RDDS

  • RDD Basics and its characteristics, Creating RDDs
  • RDD Operations
  • Transformations
  • Actions
  • RDD Types
  • Lazy Evaluation
  • Persistence (Caching)
  • Module-Advanced spark programming
  • Accumulators and Fault Tolerance
  • Broadcast Variables
  • Custom Partitioning

Module 13-Loading and saving your data

  • Dealing with different file formats (Text, CSV, JSON files etc.)
  • Hadoop Input and Output Formats
  • Connecting to diverse Data Sources (HDFS, Hive, S3, RDBMS and NoSQL etc.)
  • Module-Spark SQL
  • Linking with Spark SQL
  • Initializing Spark SQL
  • Data Frames &Caching
  • Case Classes, Inferred Schema
  • Loading and Saving Data
  • Apache Hive
  • Data Sources/Parquet
  • JSON
  • JDBC/ODBC Server
  • Spark SQL User Defined Functions (UDFs)
  • Hive UDFs

Module 14-KAFKA

  • Kafka introduction
  • Kafka architecture
  • Kafka fundamentals
  • Kafka basics operations

Module 15 – Real Time Concepts

  • 1 Project
  • Roles and Responsibilities
  • Real time interview questions and answers