KEMBAR78
Big Data Technologies | PDF | Apache Hadoop | Apache Spark
0% found this document useful (0 votes)
80 views7 pages

Big Data Technologies

The document outlines suggested teaching guidelines for a Big Data Technologies course, covering topics such as Hadoop, HDFS, Map Reduce, HBase, Hive, and Spark, along with lab assignments for practical experience. It also includes a section on Data Visualization and Reporting, detailing evaluation methods and recommended textbooks. The course aims to provide a comprehensive understanding of big data technologies and their applications in data analytics and visualization.

Uploaded by

Raghav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views7 pages

Big Data Technologies

The document outlines suggested teaching guidelines for a Big Data Technologies course, covering topics such as Hadoop, HDFS, Map Reduce, HBase, Hive, and Spark, along with lab assignments for practical experience. It also includes a section on Data Visualization and Reporting, detailing evaluation methods and recommended textbooks. The course aims to provide a comprehensive understanding of big data technologies and their applications in data analytics and visualization.

Uploaded by

Raghav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
Introduction to Hadoop
o A Brief History of Hadoop
o Evolution of Hadoop
o Introduction to Hadoop and its components
o Comparison with Other Systems
o Hadoop Releases
o Hadoop Distributions and Vendors
Hadoop Distributed File System (HDFS)

Session: 4 & 5
Hadoop Distributed File System (HDFS)
o Distributed File System
o What is HDFS
o Where does HDFS fit in
o Core components of HDFS
o HDFS Daemons
o Hadoop Server Roles: Name Node, Secondary Name Node, and Data
Node
HDFS Architecture
o HDFS Architecture
o Scaling and Rebalancing
o Replication
o Rack Awareness
o Data Pipelining,
o Node Failure Management.
o HDFS High Availability NameNode

Lab-Assignment:
o Run the HDFS commands, and add a one liner understanding
for each of the command.
o Execute the provided code using HDFS, step run and understand

Session: 6
Getting Started: Hadoop Installation
o Hadoop Operation modes
o Setting up a Hadoop Cluster
o Cluster specification
o Single and Multi-Node Cluster Setup on Virtual & Physical Machines,
o Remote Login using Putty/Mac Terminal/Ubuntu Terminal.
o Hadoop Configuration, Security in Hadoop, Administering Hadoop,
o HDFS – Monitoring & Maintenance, Hadoop benchmarks,
o Hadoop in the cloud.

Session: 7
Hadoop Architecture
o Hadoop Architecture,
o Core components of Hadoop,
o Common Hadoop Shell commands.

PG-DBDA Page 2 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
Session: 8
HDFS Data Storage Process
o HDFS Data storage process,
o Anatomy of writing and reading file in HDFS,
o Handling Read/Write failures
o HDFS user and admin commands,
o HDFS Web Interface.

Session: 9
Getting in touch with Map Reduce Framework
o Hadoop Map Reduce paradigm,
o Map and Reduce tasks,
o Map Reduce Execution Framework,
o Map Reduce Daemons
o Anatomy of a Map Reduce Job run
More Map Reduce Concepts
o Partitioners and Combiners,
o Input Formats (Input Splits and Records, Text Input, Binary Input,
Multiple Inputs),
o Output Formats (Text Output, Binary Output, Multiple Output).
o Distributed Cache

Session: 10
Basics of Map Reduce Programming
o Hadoop Data Types,
o Java and Map Reduce,
o Map Reduce program structure,
o Map-only program, Reduce-only program,
o Use of combiner and partitioner,
o Counters, Schedulers (Job Scheduling),
o Custom Writables, Compression

Lab-Assignment:
o Execute the train data example.
o Execute the train data example using chained methods.

Session: 11
Map Reduce Streaming
o Complex Map Reduce programming,
o Map Reduce streaming,
o Python and Map Reduce,
o Map Reduce on image dataset

Hadoop ETL
Session: 12

o Hadoop ETL Development,


o ETL Process in Hadoop,
o Discussion of ETL functions,

PG-DBDA Page 3 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
o Data Extractions,
o Need of ETL tools,
o Advantages of ETL tools.

Lab-Assignment:
o Understand the file formats and read the provided links

Session: 13
Introduction to HBase
o Overview of HBase
o HBase architecture
o Installation

Session: 14 & 15
The HBaseAdmin and HBase Security
o Various Operations on Tables
o HBase general command and shell,
o java client API for HBase
o Admin API
o CRUD operations
o Client API
o HBase – Scan, Count and Truncate
o HBase Security

Lab-Assignment:
o Run the Hbase shell commands
o Run the HBase using Java client

Session: 16
The Hive Data-ware House
o Introduction to Hive,
o Hive architecture and Installation,
o Comparison with Traditional Database,
o Basics of Hive Query Language.

Session: 17
Working with Hive QL
o Datatypes,
o Operators and Functions,
o Hive Tables (Managed Tables and Extended Tables),
o Partitions and Buckets,
o Storage Formats,
o Importing data,
o Altering and Dropping Tables

PG-DBDA Page 4 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
Lab-Assignment:
o Creative a hive DB and table ( internal and external )
o Load the data into hive table (using local inpath and HSFS inpath)

Session: 18
Querying with Hive QL
o Querying Data-Sorting,
o Aggregating,
o Map Reduce Scripts,
o Joins and Sub queries,
o Views,
o Map and Reduce side joins to optimize query.

Lab-Assignment:
o Run all the types of joins in Hive
o Execute the data to be partitioned

Session: 19
More on Hive QL
o Data manipulation with Hive,
o UDFs,
o Appending data into existing Hive table,
o custom map/reduce in Hive
o Writing HQL scripts

Session: 20, 21 & 22


o Introduction to Data Warehousing and Data Lakes
o Designing Data warehousing for an ETL Data Pipeline
o Designing Data Lakes for an ETL Data Pipeline
o ETL vs ELT
o Fundamentals of Airflow/Informatica
o Work management with Airflow/ Informatica
o Automating an entire Data Pipeline with Airflow/Informatica

Lab-Assignment:
o Create an airflow DAG/ Informatica for Extract -> Transform -> Load

Session: 23, 24 & 25


Apache Spark APIs for large-scale data processing
o Overview, Linking with Spark, Initializing Spark,
o Resilient Distributed Datasets (RDDs), External Datasets
o RDD v/s Data frames v/s Datasets
o Data frame operations
o Structured Spark Streaming
o Passing Functions to Spark, Working with Key-Value Pairs, Shuffle
operations,
o RDD Persistence, Removing Data, Shared Variables, Deploying to a Cluster

PG-DBDA Page 5 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
Lab-Assignment:
o Run the provided Hadoop Streaming program using python

Session: 26
o Map Reduce with Spark
o Working with Spark with Hadoop
o Working with Spark without Hadoop and their Differences

Lab Assignment
o Execute all the provided code using step-runs for each and every
codeline

o Setup the JDBC configuration and run the Spark JDBC Connectivity
program
o Run the spark integrations using the provided code

Session: 27
o Data preprocessing
o EDA

Session: 28 & 29
o Introduction to Kafka
o Working with Kafka using Spark
o Spark streaming Architecture
o Spark Streaming APIs
o Building Stream Processing Application with Spark

Lab Assignment
o Execute the spark streaming with Kafka

Session: 30
o Setting up Kafka Producer and Consumer
o Kafka Connect API

Session: 31
o Spark SQL

Lab Assignment
o Run the sparkSQL programs using step-runs for each and every
codeline
o Run all the SparkSQL programs
o Analyse the election data using spark and provide analysis

Session: 32 & 33
o Spark MLlib
o Predictive Analysis

PG-DBDA Page 6 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Big Data Technologies PG-DBDA August
2024
Lab Assignment:
o Deep Learning with Spark
o Connecting DB’s with Spark
o Accessing and manipulating the DB’s
o Demo: Capstone Project
o Create a complex workflow using bash operator, a simple workflow
using python
o Create Using python airflow operator to read data from your local
drive, ingest the data into your HDFS, and perform a spark WC

PG-DBDA Page 7 of 7
ACTS, Pune

Suggested Teaching Guidelines for


Data Visualization - Analysis and Reporting
PG-DBDA August 2024

Duration: 26 Classroom hours + 24 Lab hours

Objective: To introduce students in Data Analytics, Visualization and Reporting

Prerequisites: Knowledge of Database Fundamentals and Big Data


Technologies.

Evaluation method: Theory exam– 40% weightage


Lab exam – 40% weightage
Internal exam– 20% weightage

List of Books / Other training material

Text Book:
1. Communicating Data with Tableau, Ben Jones, O'Reilly, Shroff Publishers &
Distributors,Tableau 8.1.

Reference Book:

1. Mastering Microsoft Power BI: Expert Techniques for Effective Data Analytics
and Business Intelligence Book by Brett Powell
2. Designing Data Visualizations, by Steele,O'Reilly
3. Tableau your data, by Daniel G/ Wiley
4. Graphs Cookbook, Hrishi V. Mittal, Packt Publishing
5. Python Data Visualization Cookbook,Igor Milovanović, Packt Publishing
6. Learning Python Data Visualization, Chad Adams, Packt Publishing
7. Data Visualization with D3.js Cookbook,Nick Qui Zhu,Packt Publishing
8. Getting Started with D3,Mike Dewar,O'Reilly
9. Data Visualization with JavaScript
10. Data Visualization for Dummies
11. High Impact Data Visualization with Power View, Power Map, and Power BI
12. The Visual Organization: Data Visualization, Big Data, and the Quest for Better
Decisions
13. Mastering Tableau 2021:- by Marleen Meier

Note:
o Tool to be use: Tableau

Session 1 & 2:
o Business Intelligence basic,
o Information gathering,
o Decision making,
o Managing BI,

PG-DBDA
Page 1 of 3

You might also like