BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK INTEGRATED LEARNING PROGRAMMES
Digital
Part A: Content Design
Course Title STREAM PROCESSING AND ANALYTICS
Course No(s) DSECL ZC556
Credit Units 5
Credit Model
Content Authors SURYA PRAKASH G
Course Description
Data is moving at very rapid space because of which necessarily of scalable systems capable of
processing and analyzing this fast, streaming data has arisen. This course introduces the students with
the architecture of streaming data processing systems. This course also enables students to understand
the complete end-to-end solution for cost-effective analysis and visualization of streaming data with the
help of various open source solutions available in this space. This course also helps students to learn
the implementation and application of algorithms and data structures required for the streaming
applications. Advanced streaming applications like Streaming SQL, Streaming Machine Learning will
be discussed at proper length.
Course Objectives
No
CO1 To introduce the applications of streaming data systems
CO2 To introduce the architecture of streaming data systems
CO3 To introduce the algorithmic techniques used in streaming data systems
CO4 To present survey of tools and techniques required for streaming data analytics
Text Book(s)
T1 Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Byron
Ellis, 2014, Wiley
http://www-di.inf.puc-rio.br/~endler/courses/RT-Analytics/transp/Books/Real-
Time%20Analytics%20Techniques%20to%20Analyze.pdf
T2 Streaming Data: Understanding The Real-Time Pipeline, Andrew G.Psaltis, 2017,
Manning Publications
Reference Book(s) & other resources
R1 Big Data – Principles and best practices of scalable real-time data systems,
Page | 1
Nathan Marz, James Warren, 2017, Manning Publications
R2 Designing Data Intensive Applications, Martin Kleppmann, O’Reilly
Learning Outcomes:
No Learning Outcomes
LO1 Understand the components of streaming data systems with their capabilities and
characteristics
LO2 Learn the relevant architecture and best practices for processing and analysis of
streaming data
LO3 Gain knowledge about the development of system for data aggregation, delivery
and storage using Open source tools
LO4 Get familiarity with the advance streaming applications like Streaming SQL,
Streaming machine learning
Part B: Learning Plan
Academic Term
Course Title STREAM PROCESSING AND ANALYTICS
Course No
Lead Instructor
Glossary of Terms
Module M Module is a standalone quantum of designed content. A typical course is
delivered using a string of modules. M2 means module 2.
Contact Hour CH Contact Hour (CH) stands for a hour long live session with students
conducted either in a physical classroom or enabled through
technology. In this model of instruction, instructor led sessions will
be for 32 CH.
Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the
Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.
Lab Exercises LE Lab exercises associated with various modules
Self-Study SS Specific content assigned for self study
Homework HW Specific problems/design/lab exercises assigned as homework
Page | 2
Modular Structure
No. Title of the Module
M1 Scalable Streaming Data Systems
M2 Streaming Data Systems Architecture
M3 Streaming Data Frameworks
M4 Streaming Analytics
M5 Advanced Streaming Applications
Detailed Lecture Plan
M1: Scalable Streaming Data Systems
Session 1 to 3 / Contact Hour 1 - 6
Time Type Description/Plan Reference
Session 1 CH1 • Thinking about Data Systems R1 Ch1
• Reliable, Scalable and Maintainable Data Applications
• Properties of Data R2 Ch2
CH2 • Scaling with the traditional databases R2 Ch1
• Big Data Systems
• Desired properties of Big Data Systems
Session 2 CH3 • Data Model for Big Data R2 Ch2
• Generalized Big Data System Architecture Class Notes
CH4 • Real time systems T2 Ch1
• Difference between Batch processing and Stream Class Notes
Processing
• Difference between real time and streaming systems
Session 3 CH5 • Streaming Data Applications Class Notes
• Databases and Streams R1 Ch11
• Usage patterns of Streaming Data Class Notes
CH6 • Sources of Streaming Data T1 Ch1
• Complex Event Processing Systems Class Notes
Post CH SS • Explore more on the non functional requirements of Data Intensive
Applications
✓ Non-functional Requirements for Real World Big Data Systems
✓ IBM Big Data & Analytics RA_V1
Page | 3
• Explore more on the differences between the batch processing and
streaming data applications
✓ Batch vs Real time data processing
• Identify the use cases of Complex Event Processing Systems
✓ What is stream processing?
✓ complex-event-processing
M2: Streaming Data Systems Architecture
Session 4 to 8 / Contact Hour 7 - 16
Time Type Description/Plan Reference
Session 4 CH7 • Generalized Streaming Data Architecture T2 Ch 1
CH8 • Lambda Architecture T2 Ch 2
• Kappa Architecture Class Notes
• Streaming Data system Component T1 Ch2
• Features of Real time Architecture
• A real time architecture checklist
Session CH9 • Service Configuration and Coordination Systems T1 Ch2
5-6 • Maintaining the state
CH 10 • Apache ZooKeeper T1 Ch3
• Data Flow Manager T1 Ch4
• Managing distributed data flows with Kafka Docs
Apache Kafka
CH 11 • Kafka Fundamentals Overview T1 Ch4
CH 12 • Use-Cases and applications T1 Ch4
• Architecture Kafka Docs
• Kafka Topics, Producer and Consumer Using CLI
• Programming Kafka
• Simple Kafka Producer
• Simple Kafka Consumer
• Producer, Consumer Configuration
• Producer, Consumer Execution
• Kafka Consumer Groups
Session CH13 • Streaming Data Processor Concepts T1 Ch 5
7-8 • Timing Concepts T2 Ch 5
CH14 • Windowing T2 Ch5
• Joins R1 Ch11
CH15 • Storage for Streaming Data T1 Ch6
• NoSQL storage Systems
CH16 • Choosing a Storage technology T1 Ch7
• Delivery of Streaming Metrics
Page | 4
Post CS SS • Explore in detail about issues with Lambda Architecture
✓ questioning-the-lambda-architecture
✓ a-brief-introduction-to-two-data-processing-
architectures
• Explore the Java APIs exposed by following systems
✓ Apache ZooKeeper
✓ Apache Kafka
• Explore the data models of NoSQL data systems
•
✓ MongoDB
✓ Cassandra
Self study on other frameworks
M3: Streaming Data Frameworks
Session 9 to 11 / Contact Hour 17 - 24
Time Type Description/Plan Reference
Session 9 CH 17 • Key features of Streaming Data Frameworks Class Notes
• Survey of Streaming Data Systems
CH 18 • Apache Spark Streaming Spark Streaming
SELF Exploration/Assignment on the following Guide
• Apache Flink Flink Docs
• Apache Samza Samza Docs
• Apache Kafka Streaming Kafka Streaming
Guide
• Apache Storm Storm Docs
Session CH 19 • Apache Spark Streaming Spark Streaming
10
CH 20 • Spark Streaming fundamentals Guide
• Motivation
• Difference between Spark Streaming API and Spark
API
• Architecture
• Components of Spark Engine
• Spark Application Architecture
• Fault Tolerance
• Comparison with Traditional Streaming Systems
Session CH 21 Spark Streaming
11
CH 22 • Spark + Kafka integration Guide
Session CH 23 • Structured Streaming Structured
12 • Developing application in Databricks platform Streaming Docs
CH 24 Class Notes
Post CH SS • Compare the different streaming data platforms and
Page | 5
identify the use cases for which they are suitable
• Implement the streaming data pipeline using the Kafka Kafka Streaming
Streaming library Guide
• Implement a streaming data application with Spark Spark Streaming
streaming Guide
M4: Streaming Analytics
Session 13 to 14 / Contact Hour 25 - 28
Time Type Description/Plan Reference
Session CH 25 • Exact Aggregation of Streaming Data T1 Ch 8
13 • Time Series Analysis
CH 26 T1 Ch8
Session CH 27 • Registers and Hash Functions T1 Ch 10
14 • The Bloom Filter
CH 28 • Distinct Value Sketches T1 Ch 10
• The Count-Min Sketch
Post CH SS • Study illustrations for Streaming data concepts Class Notes
• Explore algorithms for aggregation of streaming data
• Explore more about the streaming data processing
algorithms for exact results
M5: Advanced Streaming Applications
Session 15 to 16 / Contact Hour 29 - 32
Time Type Description/Plan Reference
Session CH29 • Necessity of Streaming SQL Streaming SQL
15 • Streaming SQL : Windows Blog
• Streaming SQL : Joins
• Streaming SQL : Patterns
CH30 • Streaming SQL for Apache Kafka Kafka Streaming
• KSQL SQL
Session CH 30 • Streaming Analytics with Cloud Kinesis Docs
16 • AWS Kinesis
CH 31 • Data Streams Databricks docs
• Data Firehose
• Data Analytics
• AWS IoT / Streaming Analytics Service Azure Docs
• Channels, Pipelines
Page | 6
• Data stores & data sets
• Streaming ML Frameworks
Class notes
Post CH SS • Get familiarized with Streaming SQL tools
✓ Kafka Streaming SQL
• Build and deploy machine learning models using Spark
structured streaming
✓ structured-streaming-ml
Evaluation Scheme:
Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session
No Name Type Duration Weight Day, Date, Session, Time
EC-1 Assignment-1 Take home 10 days 10% TBA
Assignment-2 Take home 15 days 15% TBA
Quiz-1 Online 1 day 5% TBA
EC-2 Mid-Semester Exam Closed Book 2 hours 30% TBA
EC-3 Comprehensive Open Book 3 hours 40% TBA
Exam
Notes:
Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 8 (contact hours 1 to 16)
Syllabus for Comprehensive Exam (Open Book): All topics
Important links and information:
Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the
latest announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the
Elearn portal.
Evaluation Guidelines:
1. EC-1 consists of either two Assignments or three Quizzes. Students will attempt them
through the course pages on the Elearn portal. Announcements will be made on the
portal, in a timely manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed
or bound) is permitted. However, loose sheets of paper will not be allowed. Use of
calculators is permitted in all exams. Laptops/Mobiles of any kind are not allowed.
Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies,
the student should follow the procedure to apply for the Make-Up Test/Exam which
will be made available on the Elearn portal. The Make-Up Test/Exam will be conducted
only at selected exam centres on the dates to be announced later.
It shall be the responsibility of the individual student to be regular in maintaining the self study
schedule as given in the course handout, attend the online lectures, and take all the prescribed
evaluation components such as Assignment/Quiz, Mid-Semester Test and Comprehensive
Exam according to the evaluation scheme provided in the handout.
Page | 7