ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
Introduction to Hadoop
o A Brief History of Hadoop
o Evolution of Hadoop
o Introduction to Hadoop and its components
o Comparison with Other Systems
o Hadoop Releases
o Hadoop Distributions and Vendors
Hadoop Distributed File System (HDFS)
Session: 4 & 5
Hadoop Distributed File System (HDFS)
o Distributed File System
o What is HDFS
o Where does HDFS fit in
o Core components of HDFS
o HDFS Daemons
o Hadoop Server Roles: Name Node, Secondary Name Node, and Data
Node
HDFS Architecture
o HDFS Architecture
o Scaling and Rebalancing
o Replication
o Rack Awareness
o Data Pipelining,
o Node Failure Management.
o HDFS High Availability NameNode
Lab-Assignment:
o Run the HDFS commands, and add a one liner understanding
for each of the command.
o Execute the provided code using HDFS, step run and understand
Session: 6
Getting Started: Hadoop Installation
o Hadoop Operation modes
o Setting up a Hadoop Cluster
o Cluster specification
o Single and Multi-Node Cluster Setup on Virtual & Physical Machines,
o Remote Login using Putty/Mac Terminal/Ubuntu Terminal.
o Hadoop Configuration, Security in Hadoop, Administering Hadoop,
o HDFS – Monitoring & Maintenance, Hadoop benchmarks,
o Hadoop in the cloud.
Session: 7
Hadoop Architecture
o Hadoop Architecture,
o Core components of Hadoop,
o Common Hadoop Shell commands.
PG-DBDA Page 2 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
Session: 8
HDFS Data Storage Process
o HDFS Data storage process,
o Anatomy of writing and reading file in HDFS,
o Handling Read/Write failures
o HDFS user and admin commands,
o HDFS Web Interface.
Session: 9
Getting in touch with Map Reduce Framework
o Hadoop Map Reduce paradigm,
o Map and Reduce tasks,
o Map Reduce Execution Framework,
o Map Reduce Daemons
o Anatomy of a Map Reduce Job run
More Map Reduce Concepts
o Partitioners and Combiners,
o Input Formats (Input Splits and Records, Text Input, Binary Input,
Multiple Inputs),
o Output Formats (Text Output, Binary Output, Multiple Output).
o Distributed Cache
Session: 10
Basics of Map Reduce Programming
o Hadoop Data Types,
o Java and Map Reduce,
o Map Reduce program structure,
o Map-only program, Reduce-only program,
o Use of combiner and partitioner,
o Counters, Schedulers (Job Scheduling),
o Custom Writables, Compression
Lab-Assignment:
o Execute the train data example.
o Execute the train data example using chained methods.
Session: 11
Map Reduce Streaming
o Complex Map Reduce programming,
o Map Reduce streaming,
o Python and Map Reduce,
o Map Reduce on image dataset
Hadoop ETL
Session: 12
o Hadoop ETL Development,
o ETL Process in Hadoop,
o Discussion of ETL functions,
PG-DBDA Page 3 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
o Data Extractions,
o Need of ETL tools,
o Advantages of ETL tools.
Lab-Assignment:
o Understand the file formats and read the provided links
Session: 13
Introduction to HBase
o Overview of HBase
o HBase architecture
o Installation
Session: 14 & 15
The HBaseAdmin and HBase Security
o Various Operations on Tables
o HBase general command and shell,
o java client API for HBase
o Admin API
o CRUD operations
o Client API
o HBase – Scan, Count and Truncate
o HBase Security
Lab-Assignment:
o Run the Hbase shell commands
o Run the HBase using Java client
Session: 16
The Hive Data-ware House
o Introduction to Hive,
o Hive architecture and Installation,
o Comparison with Traditional Database,
o Basics of Hive Query Language.
Session: 17
Working with Hive QL
o Datatypes,
o Operators and Functions,
o Hive Tables (Managed Tables and Extended Tables),
o Partitions and Buckets,
o Storage Formats,
o Importing data,
o Altering and Dropping Tables
PG-DBDA Page 4 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
Lab-Assignment:
o Creative a hive DB and table ( internal and external )
o Load the data into hive table (using local inpath and HSFS inpath)
Session: 18
Querying with Hive QL
o Querying Data-Sorting,
o Aggregating,
o Map Reduce Scripts,
o Joins and Sub queries,
o Views,
o Map and Reduce side joins to optimize query.
Lab-Assignment:
o Run all the types of joins in Hive
o Execute the data to be partitioned
Session: 19
More on Hive QL
o Data manipulation with Hive,
o UDFs,
o Appending data into existing Hive table,
o custom map/reduce in Hive
o Writing HQL scripts
Session: 20, 21 & 22
o Introduction to Data Warehousing and Data Lakes
o Designing Data warehousing for an ETL Data Pipeline
o Designing Data Lakes for an ETL Data Pipeline
o ETL vs ELT
o Fundamentals of Airflow/Informatica
o Work management with Airflow/ Informatica
o Automating an entire Data Pipeline with Airflow/Informatica
Lab-Assignment:
o Create an airflow DAG/ Informatica for Extract -> Transform -> Load
Session: 23, 24 & 25
Apache Spark APIs for large-scale data processing
o Overview, Linking with Spark, Initializing Spark,
o Resilient Distributed Datasets (RDDs), External Datasets
o RDD v/s Data frames v/s Datasets
o Data frame operations
o Structured Spark Streaming
o Passing Functions to Spark, Working with Key-Value Pairs, Shuffle
operations,
o RDD Persistence, Removing Data, Shared Variables, Deploying to a Cluster
PG-DBDA Page 5 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
Lab-Assignment:
o Run the provided Hadoop Streaming program using python
Session: 26
o Map Reduce with Spark
o Working with Spark with Hadoop
o Working with Spark without Hadoop and their Differences
Lab Assignment
o Execute all the provided code using step-runs for each and every
codeline
o Setup the JDBC configuration and run the Spark JDBC Connectivity
program
o Run the spark integrations using the provided code
Session: 27
o Data preprocessing
o EDA
Session: 28 & 29
o Introduction to Kafka
o Working with Kafka using Spark
o Spark streaming Architecture
o Spark Streaming APIs
o Building Stream Processing Application with Spark
Lab Assignment
o Execute the spark streaming with Kafka
Session: 30
o Setting up Kafka Producer and Consumer
o Kafka Connect API
Session: 31
o Spark SQL
Lab Assignment
o Run the sparkSQL programs using step-runs for each and every
codeline
o Run all the SparkSQL programs
o Analyse the election data using spark and provide analysis
Session: 32 & 33
o Spark MLlib
o Predictive Analysis
PG-DBDA Page 6 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA August
2024
Lab Assignment:
o Deep Learning with Spark
o Connecting DB’s with Spark
o Accessing and manipulating the DB’s
o Demo: Capstone Project
o Create a complex workflow using bash operator, a simple workflow
using python
o Create Using python airflow operator to read data from your local
drive, ingest the data into your HDFS, and perform a spark WC
PG-DBDA Page 7 of 7
ACTS, Pune
Suggested Teaching Guidelines for
Data Visualization - Analysis and Reporting
PG-DBDA August 2024
Duration: 26 Classroom hours + 24 Lab hours
Objective: To introduce students in Data Analytics, Visualization and Reporting
Prerequisites: Knowledge of Database Fundamentals and Big Data
Technologies.
Evaluation method: Theory exam– 40% weightage
Lab exam – 40% weightage
Internal exam– 20% weightage
List of Books / Other training material
Text Book:
1. Communicating Data with Tableau, Ben Jones, O'Reilly, Shroff Publishers &
Distributors,Tableau 8.1.
Reference Book:
1. Mastering Microsoft Power BI: Expert Techniques for Effective Data Analytics
and Business Intelligence Book by Brett Powell
2. Designing Data Visualizations, by Steele,O'Reilly
3. Tableau your data, by Daniel G/ Wiley
4. Graphs Cookbook, Hrishi V. Mittal, Packt Publishing
5. Python Data Visualization Cookbook,Igor Milovanović, Packt Publishing
6. Learning Python Data Visualization, Chad Adams, Packt Publishing
7. Data Visualization with D3.js Cookbook,Nick Qui Zhu,Packt Publishing
8. Getting Started with D3,Mike Dewar,O'Reilly
9. Data Visualization with JavaScript
10. Data Visualization for Dummies
11. High Impact Data Visualization with Power View, Power Map, and Power BI
12. The Visual Organization: Data Visualization, Big Data, and the Quest for Better
Decisions
13. Mastering Tableau 2021:- by Marleen Meier
Note:
o Tool to be use: Tableau
Session 1 & 2:
o Business Intelligence basic,
o Information gathering,
o Decision making,
o Managing BI,
PG-DBDA
Page 1 of 3