KEMBAR78
StreamProcessingAndAnalytics Handout | PDF | Apache Spark | Analytics
0% found this document useful (0 votes)
108 views7 pages

StreamProcessingAndAnalytics Handout

This course introduces students to streaming data processing systems and analytics. It covers topics like streaming data architectures, frameworks, analytics techniques and applications. Students will learn about components of streaming systems, architectures like Lambda and Kappa, frameworks like Kafka and tools for processing, analyzing and storing streaming data.

Uploaded by

2022da04009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views7 pages

StreamProcessingAndAnalytics Handout

This course introduces students to streaming data processing systems and analytics. It covers topics like streaming data architectures, frameworks, analytics techniques and applications. Students will learn about components of streaming systems, architectures like Lambda and Kappa, frameworks like Kafka and tools for processing, analyzing and storing streaming data.

Uploaded by

2022da04009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES


Digital
Part A: Content Design

Course Title STREAM PROCESSING AND ANALYTICS


Course No(s) DSECL ZC556
Credit Units 5
Credit Model
Content Authors SURYA PRAKASH G

Course Description

Data is moving at very rapid space because of which necessarily of scalable systems capable of
processing and analyzing this fast, streaming data has arisen. This course introduces the students with
the architecture of streaming data processing systems. This course also enables students to understand
the complete end-to-end solution for cost-effective analysis and visualization of streaming data with the
help of various open source solutions available in this space. This course also helps students to learn
the implementation and application of algorithms and data structures required for the streaming
applications. Advanced streaming applications like Streaming SQL, Streaming Machine Learning will
be discussed at proper length.

Course Objectives

No

CO1 To introduce the applications of streaming data systems

CO2 To introduce the architecture of streaming data systems

CO3 To introduce the algorithmic techniques used in streaming data systems

CO4 To present survey of tools and techniques required for streaming data analytics

Text Book(s)

T1 Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, Byron


Ellis, 2014, Wiley
http://www-di.inf.puc-rio.br/~endler/courses/RT-Analytics/transp/Books/Real-
Time%20Analytics%20Techniques%20to%20Analyze.pdf

T2 Streaming Data: Understanding The Real-Time Pipeline, Andrew G.Psaltis, 2017,


Manning Publications

Reference Book(s) & other resources

R1 Big Data – Principles and best practices of scalable real-time data systems,

Page | 1
Nathan Marz, James Warren, 2017, Manning Publications
R2 Designing Data Intensive Applications, Martin Kleppmann, O’Reilly

Learning Outcomes:

No Learning Outcomes

LO1 Understand the components of streaming data systems with their capabilities and
characteristics

LO2 Learn the relevant architecture and best practices for processing and analysis of
streaming data

LO3 Gain knowledge about the development of system for data aggregation, delivery
and storage using Open source tools

LO4 Get familiarity with the advance streaming applications like Streaming SQL,
Streaming machine learning

Part B: Learning Plan

Academic Term
Course Title STREAM PROCESSING AND ANALYTICS
Course No
Lead Instructor

Glossary of Terms

Module M Module is a standalone quantum of designed content. A typical course is


delivered using a string of modules. M2 means module 2.

Contact Hour CH Contact Hour (CH) stands for a hour long live session with students
conducted either in a physical classroom or enabled through
technology. In this model of instruction, instructor led sessions will
be for 32 CH.

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the


Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self study

Homework HW Specific problems/design/lab exercises assigned as homework

Page | 2
Modular Structure

No. Title of the Module


M1 Scalable Streaming Data Systems
M2 Streaming Data Systems Architecture
M3 Streaming Data Frameworks
M4 Streaming Analytics
M5 Advanced Streaming Applications

Detailed Lecture Plan

M1: Scalable Streaming Data Systems

Session 1 to 3 / Contact Hour 1 - 6

Time Type Description/Plan Reference


Session 1 CH1 • Thinking about Data Systems R1 Ch1
• Reliable, Scalable and Maintainable Data Applications
• Properties of Data R2 Ch2

CH2 • Scaling with the traditional databases R2 Ch1


• Big Data Systems
• Desired properties of Big Data Systems

Session 2 CH3 • Data Model for Big Data R2 Ch2


• Generalized Big Data System Architecture Class Notes

CH4 • Real time systems T2 Ch1


• Difference between Batch processing and Stream Class Notes
Processing
• Difference between real time and streaming systems

Session 3 CH5 • Streaming Data Applications Class Notes


• Databases and Streams R1 Ch11
• Usage patterns of Streaming Data Class Notes

CH6 • Sources of Streaming Data T1 Ch1


• Complex Event Processing Systems Class Notes

Post CH SS • Explore more on the non functional requirements of Data Intensive


Applications

✓ Non-functional Requirements for Real World Big Data Systems


✓ IBM Big Data & Analytics RA_V1

Page | 3
• Explore more on the differences between the batch processing and
streaming data applications
✓ Batch vs Real time data processing

• Identify the use cases of Complex Event Processing Systems


✓ What is stream processing?
✓ complex-event-processing

M2: Streaming Data Systems Architecture

Session 4 to 8 / Contact Hour 7 - 16

Time Type Description/Plan Reference


Session 4 CH7 • Generalized Streaming Data Architecture T2 Ch 1
CH8 • Lambda Architecture T2 Ch 2
• Kappa Architecture Class Notes

• Streaming Data system Component T1 Ch2


• Features of Real time Architecture
• A real time architecture checklist
Session CH9 • Service Configuration and Coordination Systems T1 Ch2
5-6 • Maintaining the state
CH 10 • Apache ZooKeeper T1 Ch3
• Data Flow Manager T1 Ch4
• Managing distributed data flows with Kafka Docs
Apache Kafka

CH 11 • Kafka Fundamentals Overview T1 Ch4


CH 12 • Use-Cases and applications T1 Ch4
• Architecture Kafka Docs
• Kafka Topics, Producer and Consumer Using CLI
• Programming Kafka
• Simple Kafka Producer
• Simple Kafka Consumer
• Producer, Consumer Configuration
• Producer, Consumer Execution
• Kafka Consumer Groups
Session CH13 • Streaming Data Processor Concepts T1 Ch 5
7-8 • Timing Concepts T2 Ch 5

CH14 • Windowing T2 Ch5


• Joins R1 Ch11
CH15 • Storage for Streaming Data T1 Ch6
• NoSQL storage Systems

CH16 • Choosing a Storage technology T1 Ch7


• Delivery of Streaming Metrics

Page | 4
Post CS SS • Explore in detail about issues with Lambda Architecture
✓ questioning-the-lambda-architecture
✓ a-brief-introduction-to-two-data-processing-
architectures

• Explore the Java APIs exposed by following systems


✓ Apache ZooKeeper
✓ Apache Kafka

• Explore the data models of NoSQL data systems



✓ MongoDB
✓ Cassandra
Self study on other frameworks

M3: Streaming Data Frameworks

Session 9 to 11 / Contact Hour 17 - 24

Time Type Description/Plan Reference


Session 9 CH 17 • Key features of Streaming Data Frameworks Class Notes
• Survey of Streaming Data Systems

CH 18 • Apache Spark Streaming Spark Streaming


SELF Exploration/Assignment on the following Guide
• Apache Flink Flink Docs
• Apache Samza Samza Docs
• Apache Kafka Streaming Kafka Streaming
Guide
• Apache Storm Storm Docs

Session CH 19 • Apache Spark Streaming Spark Streaming


10
CH 20 • Spark Streaming fundamentals Guide
• Motivation
• Difference between Spark Streaming API and Spark
API
• Architecture
• Components of Spark Engine
• Spark Application Architecture
• Fault Tolerance
• Comparison with Traditional Streaming Systems
Session CH 21 Spark Streaming
11
CH 22 • Spark + Kafka integration Guide

Session CH 23 • Structured Streaming Structured


12 • Developing application in Databricks platform Streaming Docs
CH 24 Class Notes
Post CH SS • Compare the different streaming data platforms and

Page | 5
identify the use cases for which they are suitable

• Implement the streaming data pipeline using the Kafka Kafka Streaming
Streaming library Guide

• Implement a streaming data application with Spark Spark Streaming


streaming Guide

M4: Streaming Analytics

Session 13 to 14 / Contact Hour 25 - 28

Time Type Description/Plan Reference


Session CH 25 • Exact Aggregation of Streaming Data T1 Ch 8
13 • Time Series Analysis
CH 26 T1 Ch8
Session CH 27 • Registers and Hash Functions T1 Ch 10
14 • The Bloom Filter

CH 28 • Distinct Value Sketches T1 Ch 10


• The Count-Min Sketch

Post CH SS • Study illustrations for Streaming data concepts Class Notes

• Explore algorithms for aggregation of streaming data

• Explore more about the streaming data processing


algorithms for exact results

M5: Advanced Streaming Applications

Session 15 to 16 / Contact Hour 29 - 32

Time Type Description/Plan Reference


Session CH29 • Necessity of Streaming SQL Streaming SQL
15 • Streaming SQL : Windows Blog
• Streaming SQL : Joins
• Streaming SQL : Patterns
CH30 • Streaming SQL for Apache Kafka Kafka Streaming
• KSQL SQL
Session CH 30 • Streaming Analytics with Cloud Kinesis Docs
16 • AWS Kinesis
CH 31 • Data Streams Databricks docs
• Data Firehose
• Data Analytics
• AWS IoT / Streaming Analytics Service Azure Docs
• Channels, Pipelines

Page | 6
• Data stores & data sets

• Streaming ML Frameworks
Class notes
Post CH SS • Get familiarized with Streaming SQL tools
✓ Kafka Streaming SQL

• Build and deploy machine learning models using Spark


structured streaming
✓ structured-streaming-ml

Evaluation Scheme:

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session


No Name Type Duration Weight Day, Date, Session, Time
EC-1 Assignment-1 Take home 10 days 10% TBA
Assignment-2 Take home 15 days 15% TBA
Quiz-1 Online 1 day 5% TBA
EC-2 Mid-Semester Exam Closed Book 2 hours 30% TBA
EC-3 Comprehensive Open Book 3 hours 40% TBA
Exam

Notes:
Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 8 (contact hours 1 to 16)
Syllabus for Comprehensive Exam (Open Book): All topics

Important links and information:


Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the
latest announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the
Elearn portal.
Evaluation Guidelines:
1. EC-1 consists of either two Assignments or three Quizzes. Students will attempt them
through the course pages on the Elearn portal. Announcements will be made on the
portal, in a timely manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed
or bound) is permitted. However, loose sheets of paper will not be allowed. Use of
calculators is permitted in all exams. Laptops/Mobiles of any kind are not allowed.
Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies,
the student should follow the procedure to apply for the Make-Up Test/Exam which
will be made available on the Elearn portal. The Make-Up Test/Exam will be conducted
only at selected exam centres on the dates to be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self study
schedule as given in the course handout, attend the online lectures, and take all the prescribed
evaluation components such as Assignment/Quiz, Mid-Semester Test and Comprehensive
Exam according to the evaluation scheme provided in the handout.

Page | 7

You might also like