Course Course Course L T P C
20PITE54J BIGDATA FOR MACHINE LEARNING E Professional Elective
Code Name Category 3 0 2 4
Pre-requisite Co-requisite Progressive
Nil Nil
Courses Courses Courses
Course Offering Department Information Technology Data Book / Codes/Standards Nil
Course Learning Rationale
The purpose of learning this course is to: Learning Program Learning Outcomes (PLO)
(CLR):
CLR-1 : Utilize the Hadoop architecture and its use cases 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CLR-2 : Create mapper and reducer functions to build Hadoop applications
Scientific Reasoning
Reflective Thinking
Life Long Learning
Ethical Reasoning
CLR-3 : Understand key design considerations for data ingress and egress tools in Hadoop
Level of Thinking
Critical Thinking
Leadership Skills
Problem Solving
Attainment (%)
Proficiency (%)
CLR-4 : Review about MongoDB Aggregation framework
Research Skills
Self-Directed
Multicultural
Competence
Engagement
Team Work
Disciplinary
Community
CLR-5 : Infer about different kind of ecosystem tools in Hadoop
Knowledge
Reasoning
ICT Skills
Analytical
Expected
Expected
Learning
(Bloom)
Course Learning Outcomes
At the end of this course, learners will be able to:
(CLO):
CLO-1 : Understand Hadoop architecture and its Business Implications 1 80 70 L H - H L - - - L L - H - - -
CLO-2 : Build reliable, scalable distributed system with Apache Hadoop 1 85 75 M H M M H - - - M L - H - - -
CLO-3 : Import and export data into Hadoop Distributed File system 2 75 70 M H H H M - - - M L - H - - -
CLO-4 : Interpret MongoDB design goals and setup MongoDB environment 2 85 80 M H M H M - - - M L - H - - -
CLO-5 : Develop Big Data Solutions using Hadoop Eco System tools 3 85 75 H H M H H - - - M L - H - - -
Duration 15 15 15 15
15
(hour)
Basics of Data and what is Big data. Applications Blocks and replication management, HDFS Architecture Data Ingesting into Big data, What is Data Intro to PyMongo PySpark Ml-Preprocess data
SLO-1
of Big Data ingestion ? Install PyMongo, the Python Driver
S-1
Big Data requirement for traditional Data and Distributed Storage (HDFS) Sources of data which can be ingested into Steps to Connect to MongoDB Model training
SLO-2
the environment
Data warehousing and BI space, Big Data HDFS Federation SQOOP Introduction, Need for Sqoop PyMongo Basic Operations Hyper parameter training and AutoML
SLO-1
solutions
S-2
What is Distributed File System What is Name node and Data node, Name node High Where can we use sqoop, import and export Perform basic Create, Retrieve, Update and Delete Inference of Model
SLO-2
availability, syntaxes in sqoop, (CRUD) operations using PyMongo
Characteristics of Big Data and Dimensions of Component failures and recoveries, Incremental imports in SQOOP One end to end tutorial showing installation, data Deploy the model
SLO-1
Scalability loading , processing
S-3
Applications of Big data Basic Hadoop Shell commands implementation Importing data into hive using Introduction to Spark Serve the model
SLO-2
SQOOP,Case Study on SQOOP
S SLO-1 Tutorial 1:Programs in Map Reduce Tutorial4: Hadoop command hands-on Tutorial 7: Case Study Tutorial10: PyMongo Hands-on Tutorial 13: Hands-on PySpark and Various
4-5 SLO-2 examples on Spark
Historical concepts of Hadoop-Where is Hadoop Features of Hadoop 2.0 Flume ,Introduction to Ingesting data into Spark Architecture Model inference
SLO-1
S-6 used. Big Data Platforms using Flume
SLO-2 Apache Hadoop :Introduction to Hadoop The HDFS Sink Application of Data Ingestion PySpark and Data Bricks Deployment of the model
Distributed Computing Environment, What Partitioning and Interceptors Introduction to Flume, Need for Flume Case Study Export the model
SLO-1
Hadoop is & why it is important
S-7
Hadoop comparison with traditional systems, Different File Formats used Flume Architecture, Event, source, channel Introduction to Spark SQL Kafka, Data Streaming
SLO-2
and sink
SLO-1 Data and Types of Data Anatomy of File Write Demo: Data ingestion using flume Basics of Spark SQL as an ETL tool What is Kafka and its architecture ?
S-8 ,Structured, unstructured, semi-structured and Anatomy of File read, Case Study Case Study on Spark SQL Performance Tuning Connect to KSQL or SQL or Python for
SLO-2
quasi structured data analytics
S SLO-1 Tutorial 2: HDFS Commands Tutorial 5: HDFS Commands(Reading and Loading Tutorial 8: Using Sqoop and Flume Tutorial 11: Spark SQL Tutorial 14: Implementing Spark MLib
9-10 SLO-2 Files) examples
HDFS Design System Intro to Hive ,Hive Architecture Introduction to MongoDb, Understanding Case Study Twitter -> Kafka -> Spark streaming -
SLO-1
S-11 Ecosystem of MongoDB >Analytics
SLO-2 Different HDFS Shell Commands Query submission in Hive Limitations of RDBMS PySpark & Azure Data Bricks (Free) Case study,
SRM Institute of Science and Technology - Academic Curricula – (M.Tech Regulations 2020) 45
File Formats supported Hive basic operations Why NoSQL ? Business use cases of PySpark MLBasics Example using Twitter Data - MongoDB -
SLO-1
NoSQL Kafka - PySpark/ADB
S-12 Hadoop main components with a Diagram Creating table and loading data from HDFS Why choose MongoDB and advantages ? PySpark Ml :Walk through and pricing details Twitter API (access, token)
SLO-2 Explore MongoDB collections and
documents
Internal and External Table, Create a free hosted MongoDB database I PySpark Ml :nstance setup and stopping Using MongoDB and examples of MongoDB
SLO-1 HDFS overview and design, using MongoDB Atlas Working with
S-13 MongoDb,
Mapreduce - Python based Program HQL bucketing and partitioning in hive, Case study on MongoDB - Hands On PySpark Ml :Load the data ImplementingPyMongo, Analytics,Case Study
SLO-2
Case Study on HIVE
S SLO-1 Tutorial 3:Implementing HDFS Shell commands Tutorial 6: Hive Commands Tutorial 9: Mongo Db Tutorial 12:Spark Mlib examples Tutorial 15:Streaming using Kafka
14-15 SLO-2 and Python based Mapreduce programs
1. Big Data Analytics,WILEY & SAS BUSINESS SERIES
2. Simon, P., & Dexter, S. (2018). Too big to ignore: The business case for big data.
Learning
3. Baesens, B. (2014). Analytics in a big data world: The essential guide to data science and its applications.
Resources
4. Manoochehri, M. (2014). Data just right: Introduction to large-scale data & analytics.
Continuous Learning Assessment (CLA) (60% weightage)
Final Examination
Bloom’s CLA-1 CLA-2
(40% weightage)
Level of Thinking (20%) (25%) #CLA-3 (15%)
Theory Practice Theory Practice Theory Practice
Remember
Level 1 20% 20% 15% 15% 20% 15% 10%
Understand
Apply
Level 2 20% 20% 15% 15% 40% 20% 20%
Analyze
Evaluate
Level 3 10% 10% 20% 20% 40% 15% 20%
Create
Total 100 % 100 % 100 % 100 %
#CLA-3 will be a Self-Learning Component and is generally a combination from among one or more of these options:
Assignments Surprise Tests Seminars Multiple Choice Quizzes
Tech. Talks Field Visits Self-Study NPTEL/MOOC/Swayam
Mini-Projects Case-Study Group Activities Online Certifications
Presentations Debates Conference Papers Group Discussions
Course Designers
Experts from Industry Experts from Higher Technical Institutions Internal Experts
Ms Leena Shibu, Data Scientist, Great Learning Dr.N.Arunachalam, SRMIST
SRM Institute of Science and Technology - Academic Curricula – (M.Tech Regulations 2020) 46