Semester III
POSTGRADUATE DIPLOMA IN IT (PGDIT) -BIG DATA
CURRICULUM
Semester -1
Hours/week Total Marks
Paper No Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA101 Programming Java, Agile and Raptor 3 1 1 4 30 70 100
14PGDIT-BDA102 Linux fundamentals and Python 3 1 4 30 70 100
Big data – 1 (Storage and processing- Pig,
14PGDIT-BDA103 3 2 4 30 70 100
Hive).
14PGDIT-BDA104 Big data – 2 (HBase and time series). 3 2 4 30 70 100
14PGDIT-DA105 Learning Lab– 1 (Agile and Raptor) 3 1 CA=50 50
14PGDIT-DA107AL (or)
14PGDIT-DA107BL (or) Learning Lab – 2 (Big data) 3 1 CA=50 50
14PGDIT-DA107CL
12 12 1
Total 18 500
25
Semester -2
Hours/week Total Marks
Paper No. Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA201 Advanced Big Data – 1 (SPARK) 3 2 4 30 70 100
14PGDIT-BDA202 Advanced Big Data – 2 (SPARK streaming) 3 2 4 30 70 100
Advanced Big Data – 3 (SPARK – Machine
14PGDIT-BDA203 3 1 4 30 70 100
Learning)
14PGDIT-BT202A (or)
Subject (Security/Social media/Data Lake) 3 1 1 4 30 70 100
14PGDIT-BDA203 Learning Lab – 1a (applied big data) 3 1 CA-50 50
14PGDIT-BDA204L Learning Lab – 1a (applied big data) 3 1 CA-50 50
Total 12 12 1
18 500
25
1
Semester 1 Project work:
Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total
14PGDIT-BDA205L Project – Data Engineering – 1 4 13 50 + 25 100
14PGDIT-BDA206 Project – Data Engineering – 1 3 12
Final Project work (Second semester):
Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total
14PGDIT-BDA203 Project – 1a (applied big data) 4 13 50 + 25 100
14PGDIT-BDA204L Project – 1a (applied big data) 3 12
Total credits= 18+7+18+7=50 Total Hours=25+25=50 hours/week Total marks: 1200
SYLLABI
SEMESTER I
Programming Java, Agile and Raptor
XX Hours
1. Agile Programming 15 Hrs
Roles in Agile - Cross-functional Team - How an Agile Team Plans its Work? - What is a User Story? - Relationship
of User Stories and Tasks - When a Story is Done - What is Acceptance Criteria? - How the Requirements are
Defined?
Twelve Principles of Agile Manifesto - Agile – Characteristics - Iterative/incremental and Ready to Evolve - Face-
to-face Communication - Feedback Loop - User Story - Iteration – Release planning - Who is Involved? -
Prerequisites of Planning - Materials Required - Planning Data - Output
2. Raptor 15 Hrs
Program design and development process
Problem definition
Pseudo-code
Flowcharting
Code modularization
Coding, testing, and debugging
Sequence, selection, and iteration patterns
Array processing
File processing
Values and Variables
Integer Values
2
Variables and Assignment
Identifiers
Additional Integer Types
Floating-point Types
Constants
Other Numeric Types
Characters
Enumerated Types
Expressions and Arithmetic
Expressions
Mixed Type Expressions
Operator Precedence and Associativity
Comments
Compile-time Errors
Run-time Errors
Logic Errors
Compiler Warnings
Arithmetic Examples
Integer Implementation
Floating-point Implementation
Bitwise Operators
Algorithms
Conditional Execution
Type bool
Boolean Expressions
The Simple if Statement
Compound Statements
The if/else Statement
Compound Boolean Expressions
Nested Conditionals
Multi-way if/else Statements
Iteration
The while Statement
Nested Loops
3
Abnormal Loop Termination
The break statement
The goto Statement
The continue Statement
Infinite Loops
Iteration Examples
Drawing a Tree
Printing Prime Numbers
Using Functions
Introduction to Using Functions
Standard Math Functions
Maximum and Minimum
clock Function
Character Functions
Random Numbers
Arrays
Static Arrays
Pointers and Arrays
Dynamic Arrays
Copying an Array
Multidimensional Arrays
Command-line Arguments
Vectors vs. Arrays
Prime Generation with a Vector
Custom Objects
Object Basics
Instance Variables
Member Functions
Constructors
Defining a New Numeric Type
Encapsulation
Handling Exceptions
4
Motivation
Exception Examples
Custom Exceptions
Catching Multiple Exceptions
Exception Mechanics
Using Exceptions
3. Basic Java 10hrs
Creating Java Projects
Variables, Datatypes and Operators
Primitive Data Types - The Byte, Short, Int And Long
Primitive Data Types - Float And Double
Primitive Data Types - Char And Boolean
Understanding Strings
Operators In Java And Operator Precedence
Expressions, Statements, Code blocks, Methods and more
Keywords And Expressions
Statements, Whitespace and Indentation (Code Organization)
Code Blocks And The If Then Else Control Statements
Methods In Java
Method Overloading
Control Flow Statements
The switch statement
The for Statement
The while and do while statements
Euler project excercises (basic – 20)
4. Intermediate Java 10hrs
OOP Part - Classes, Constructors and Inheritance
Classes
Constructors
Inheritance
Composition
Encapsulation
Polymorphism
Advanced data types
Arrays, Java inbuilt Lists, Autoboxing and Unboxing
Arrays
List and ArrayList
Autoboxing and Unboxing
LinkedList
Inner and Abstract Classes & Interfaces
Java Generics
Naming Conventions
Packages
5
Scope
Access Modifiers
The static statement
The final statement
Java Collections
Binary Search
Collections List Methods
Comparable and Comparator
Maps
Immutable Classes
Sets & HashSet
Sorted Collections
TreeMap and Unmodifiable Maps
Euler project (Intermediate – 20)
5. Advanced Java 10 hrs
Basic Input & Output including java.util
Exceptions
Stack Trace and Call Stack
Catching and throwing Exceptions
Multi Catch Exceptions
Introduction to I/O
Writing content - FileWriter class and Finally block
FileReader and Closeable
BufferedReader
Load Big Location and Exits Files
Buffered Writer and Challenge
Byte Streams
Reading Binary Data and End of File Exceptions
Object Input Output including Serialization
Random Access File
Java NIO
Separators Temp Files and File Stores
Concurrency and Threads Introduction
Multiple Threads
Synchronisation
Producer and Consumer
Lambda Expressions
Scope and Functional Programming
Regular Expressions
Debugging and Unit Testing
Databases
Creating Databases With JDBC in Java
JDBC Insert, Update, Delete
executeQuery() and using Constants
Result Set Meta Data
Transactions
Inserting Records With JDBC
Handling Updates
6
Books & References
Text books:
1. Java for Programmers, Dietel and Dietel, Prentice Hall, 2016
Reference Books:
Thinking in Java, Bruce Eckel, Prentice Hall, 2012
Effective Java, 2nd edition,Addison- Wesly, 2008
Linux fundamentals and Python programming
XX Hours
1 : Linux Basics
Introduction
Linux and the Operating System
Graphical Environments and Interfaces
Getting Help
Text Editors
Shells, bash, and the Command Line
System Components
System Administration
Essential Command Line Tools
Command and Tool Details
Users and Groups
Bash Scripting
Files and Filesystems
Linux Intermediate
Filesystem Layout
Linux Filesystems
Compiling, Linking and Libraries
Java Installation and Environment
Python and dependency installation
Building RPM and Debian Packages
2: GIT and version control
Introduction to GIT
Git Installation
Git and Revision Control Systems
Using Git: an Example
Git Concepts and Architecture
Managing Files and the Index
7
Commits
Branches
Diffs
Merges
Managing Local and Remote Repositories
Using Patches
3 : Basic Python
Overview of Python- Starting with Python
Introduction to installation of Python
Understand Jupyter notebook & Customize Settings
Concept of Packages/Libraries - Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
Installing & loading Packages & Name Spaces
Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
List and Dictionary Comprehensions
Variable & Value Labels – Date & Time Values
Basic Operations - Mathematical - string - date
Reading and writing data
Simple plotting
Control flow & conditional statements
Debugging & Code profiling
How to create class and modules and how to call them?
Packages in python for Analytics - Numpy, scify, pandas, scikitlearn, statmodels, nltk etc
Working with Data in Python
Importing Data from various sources (Csv, txt, excel, access etc)
Database Input (Connecting to database)
Viewing Data objects - subsetting, methods
Exporting Data to various formats
Important python modules: Pandas, beautifulsoup
Cleansing Data with Python
Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables,
sampling, Data type conversions, renaming, formatting etc)
Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
Python Built-in Functions (Text, numeric, date, utility functions)
Python User Defined Functions
Stripping out extraneous information
Normalizing data
Formatting data
Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)
4: Data Analysis in Python
Introduction exploratory data analysis
Descriptive statistics, Frequency Tables and summarization
Univariate Analysis (Distribution of data & Graphical Analysis)
Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
8
Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats
etc)
5: Python for Big Data
Introduction to ODBC and data base programming in Python
Introduction to Python streaming for Hadoop
Introduction to PySpark
Sample programs for practice:
Word frequency count and visualization
Sales performance report development
Montecarlo simulation of stock price
Books & References
Based on Linux foundation training reference https://training.linuxfoundation.org/linux-
courses/development-training, LFD301- Introduction to Linux, Open Source Development, & GIT
Matering Linux, Paul S.Wang, CRC Press
Learn enough Git to be dangerous, Maichael hartl, LearnEnough.com, (https://www.learnenough.com/git-
tutorial
Python: Journey from Novice to Expert (Module 1 only), Dusty Phillips, Fabrizio Romano, Rick van
Hattem, Packt Publishing
Python: Data Analytics and Visualization, Ashish Kumar, Kirthi Raman, Martin Czygan, Phuong Vo.T.H,
Packt Publishing
Big data – 1 (Storage and processing- Pig, Hive)
1: Basics of big data
Unit 1: Course Introduction
Understanding of lab setup details
Prerequisites to run the preconfigured VMWare Virtual image, software & hardware requirements
Login credentials
Logging authorization.
Unit II: Introduction to Big Data
Introduces the Big Data
Definition
Different types of Data
Identifying the demand of Big Data and its use cases.
Unit III: Introduction to HPE IDOL
9
What is HPE IDOL
HPE IDOL use cases
Complexity of the powerful infrastructure software by examining the technology from a high-level
perspective
Understand the architecture of HEP IDOL
Different components used
Understanding of license server and its validity
HPE IDOL Server configuration
Different types of connectors supported and its uses based on the respective ports.
2. Configuration and administration
Unit IV: HPE IDOL Administration
Different services and its functionalities of basic architecture
Understanding of HP IDOL software by using the web based graphical user interface navigating the different
tabs and its uses such as how to know the status, the synchronous process, adding or creating the database,
adding the contents, initializing and indexing documents
What is indexing?
Different indexing options and its uses.
Unit V: HPE IDOL Configuration
Understating different sections in the Idol server configuration file
Start editing the sections as per the requirements.
Configuring the license server
Overview of the file system, configurations file, connector framework server configuration file and the
different uses of the different sections.
Unit VI: Exploring the connectors
Understanding different types of connectors available in HEP IDOL Software
Start working by configuring the File system connectors
Connector Framework server
HTTP connectors fetching and indexing manually From/To XML files.
3. Social media
Unit VII: Social Media Connector Configuration
Different configuration files
Start working on the live web pages/ social media connectors
Understanding the security of the Social Media
Creating the Apps
App Keys
Secret Keys
Retrieving the data and placing in the respective database
10
Unit Work: By using users created in the respective social media websites such as Facebook, Twitter with the help
of Facebook social media connector, Twitter social media connector and will have an assignment to have a LinkedIn
social media connector.
Unit VIII: HPE IDOL Media Servers
Understanding the media server and its configuration
Feasible hardware for the media server configuration.
Understanding how to ingest, analyze and encode the media
Understanding the Media Server Architecture, its system requirements, software dependenc ies.
Introduction to the Video Logger Software
Optical character recognition image server
What is speech server configuration.
Unit IX: Retrieval using IDOL Find, end user search Interface
Understanding of search engine and its uses
Setting the conceptual parameters
Understand the execution of keyword, proximity and Boolean
Conceptualizing and deploying find.
Unit X: Action Commands
Understanding different action commands
Action Command Syntax
GRE Request
Query actions
Get content actions
List actions
Saving an output
Unit XI: Parametric Search
Advanced search using the functions, and indexing the parametric data
Parametric parameters and its uses
Unit XII: Introduction to Hadoop Big Data
Introduction about Hadoop Big Data
Types of data such as Structured Data, Semi Structured Data and Unstructured data and its uses.
4.Architecture
Unit XIII: Hadoop Architecture
11
Understanding the prerequisites of hardware and software
Understanding various configurations and services of Hadoop.
Understand difference between the regular file system and Hadoop distributed file system.
Unit XIV: Introduction to MapReduce
Concept of MapReduce
Different roles of the user
Work out with jobtracker and tasktracker
Flow of MapReduce
Different concepts of MapReduce.
5.Advanced concepts
Unit XV: Advanced HDFS and MapReduce
Advanced Hadoop file system
Hadoop related concepts like identifying the steps for decommission datanode, advanced MapReduce
concepts and various joins in MapReduce
Unit XVI: Ecosystem and Its Components
Hadoop ecosystem Structure
Different components of Hadoop ecosystem and the different roles
Unit XVII: Basic Hadoop Administration, troubleshooting and security
Identification of different parameters for performance monitoring
Performance tuning
Configure the security parameters in Hadoop
Unit XVIII: Configuring & Integrating Hadoop using IDOL Connector
Understanding the configuration file of IDOL Connector for Hadoop
Different section of the configuration file
Specific changes as per the configuration along with password encryption
Setting up a secured communication.
Big Data – 2 (HBase and Time Series) XX Hours
1. Introduction to HBase
o The problem with distributed computing
o Installing HBase
o The role of HBase in the Hadoop ecosystem
o How is HBase different from RDBMS?
o HBase Data Model
o Introducing CRUD operations
o HBase is different from Hive
12
CRUD operations using the HBase Shell
o 1 - Creating a table for User Notifications
o 2 - Inserting a row
o 3 - Updating a row
o 4 - Retrieving a row
o 5 - Retrieving a range of rows
o 6 - Deleting a row
o 7 - Deleting a table
CRUD operations using the Java API
o 8 - Creating a table with HBaseAdmin
o 9 - Inserting a row using a Put object
o 10 - Inserting a list of Puts
o 11 - Retrieving data - Get and Result objects
o 12 - A list of Gets
o 13 - Deleting a row
o 14 - A list of Deletes
o 15 - Mix and match with batch operations
o 16 - Scanning a range of rows
o 17 - Deleting a table
2. HBase Architecture and advanced operationd
o HBase Architecture
Advanced operations - Filters and Counters
o 18 - Filter by Row id - RowFilter
o 19 - Filter by column value - SingleColumnValueFilter
o 20 - Apply multiple conditions - Filterlist
o 21 - Retrieve rows within a time range
o 22 - Atomically incrementing a value with Counters
3. MapReduce with HBase
o 23 : A MapReduce task to count Notifications by Type
o 23 continued: Implementing the MapReduce in Java
o Demo : Running a MapReduce task
4. Build a Notification Service
o 24 : Implement a Notification Hierarchy
o 25: Implement a Notifications Manager
5. Time series and OpenTSDB
o 26 : Time series data
o 27: Sources of time series
o 28:OpenTSDB architecture
o 29:Inserting the time series data
o 30:Querying TS data
o 31: Aggregation
o Dashboard for application monitoring
13
Advanced Big Data – 1 (SPARK – Machine Learning) XX Hours
1 : Introduction to Spark
Introduction to Apache Spark
Streaming Data Vs. In Memory Data
Map Reduce Vs. Spark
Modes of Spark
Spark Installation Demo
Overview of Spark on a cluster
Spark Standalone Cluster
Invoking Spark Shell
Creating the Spark Context
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Caching Overview
Distributed Persistence
Spark Streaming Overview(Example: Streaming Word Count)
2: Data Processing with Spark
Analyze Hive and Spark SQL Architecture
Analyze Spark SQL
Context in Spark SQL
Implement a sample example for Spark SQL
Integrating hive and Spark SQL
Support for JSON and Parquet File Formats Implement Data Visualization in Spark
Loading of Data
3. Data Processing with Spark using Hive
Hive Queries through Spark
Analyze Hive queries
Implementing sentiment analysis with Hive
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables & Accumulators
4: Spark graph
Basic graph analysis
GraphFrames API
GraphFrames motif finding
Persisting graph data
GraphFrames ETL
14
GraphFrames Property Graph analysis
Project – Social network analysis
Books & References
Workbook designed & developed for the PGDBDA
Learning Spark, Matei Zaharia, O’Reilly, 2015
Advanced Big Data – 2 (SPARK – Streaming) XX Hours
1. Introduction to Spark streaming
Architecture and Components of Spark and Spark Streaming
Batch versus real-time data processing
Architecture of Spark
Architecture of Spark Streaming
First Spark Streaming program
2. Processing Distributed Log Files in Real Time
Log files – structure and formats
Spark packaging structure and client APIs
Resilient distributed datasets and discretized streams
Data loading from distributed and varied sources
Load log files
Computing metrics and presentation
3. Applying Transformations to Streaming Data
Understanding and applying transformation functions
Basic transformations
Advanced transformations
Data pipeline for real time computing
Performance tuning
4. Persisting Log Analysis Data
Output operations in Spark Streaming
Integration with Cassandra
Integration with Advanced Spark Libraries
Querying streaming data in real time
Summary
5. Deploying in Production
Spark deployment models
High availability and fault tolerance
Monitoring streaming jobs
Summary
15
Reference books:
Learning Real-time Processing with Spark Streaming, Sumit Gupta, September 2015, Packt
books
Advanced Big Data – 3 (SPARK – Machine Learning) XX Hours
1: Introduction
Machine learning: goals, results, supervised/unsupervised - · Spark as a tool for Big Data -
Python as the language of Spark – Spark structures for machine learning – sample use cases –
life cycle of machine learning – data preparation for machine learning – binning – outlier
treatment – missing values – binning – feature selection
2: Linear methods
Linear regression – linear relations – normal distribution – heteroscedasticity - dummy variables
– correlation – regression coefficients - Use case: financial modelling
Logistic Regression – Odds – Log odds – linear relations – assumptions in logistic regression –
L1/L2 regularization · Use case: healthcare prediction
SVM (Support Vector Machines) – risk boundaries – hyper parameters – Vapnik space – SVM
classifier – SVM regression – Linear SVM – non linear SVM – single class SVM - Use case:
anomaly detection
3.Non linear and independent methods
Naive Bayes – probability – priors – conditional probability – posterior probability – discrete
variables – continuous variables - Use case: spam filtering
Decision Trees – selection of variable – impurity measures – entrophy – gini -chi square –
splitting variables - Use case: Diabetes diagnosis
Random forests – diversity metrics – voting – bagging – boosting – random trees - Use case:
Credit scoring
4: Unsupervised methods
Clustering (K-Means) – distance metrics -Euclidean – City block – power distance – similarity
metrics – dissimilarity metrics – number of clusters – elbow criterion – cluster validity –
applications of clustering - Use case: topic grouping
16
Principal Component Analysis (PCA) – dimension reduction – covariance matrix – linaer
combinations – new variables – assumptions in PCA – non linear PCA - Use case: stock analysis
5: Advanced applications of big data
Recommendation (Collaborative filtering) – Basket analysis – affinity modeling – item based
recommendation – user based recommendation – slope one recommendation – use case: Amazon
like product recommendation
17