0% found this document useful (0 votes)

16 views33 pages

C42-Batch Stream Micro Batch Realtime Processing

The document discusses different approaches to processing large datasets including batch, stream, and interactive processing. It also covers different technologies used for big data processing like MapReduce, Pig, Hive, Jaql and their similarities and differences.

Uploaded by

rjiba0ilef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views33 pages

C42-Batch Stream Micro Batch Realtime Processing

Uploaded by

rjiba0ilef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Big Data processing

Rabaa Youssef
Batch or Stream processing?
• One of the fundamental questions you need to ask when
planning out your data architecture is the question
of batch or stream processing?
Do you process data as it arrives, in real time or near-real time?
Do you wait for data to accumulate before running your job?
Batch processing
• An efficient way of processing high/large volumes of data
• Data is processed, when :
• A group of transactions is collected over a period of time (Scheduled Batch
processing : Batch ETL jobs will typically be run on a set schedule)
• the amount of data reaches a certain threshold.
1. Collect data, 2. Load and process. 3. Produce batch results.
• Hadoop uses batch data processing. (mapReduce paradigm)
Wait to do everything at once + relying on the ability of your
system to handle it all.
• Batch processing ≠ Parallel processing (could be sequential)
• Hadoop batch processing is a parallel one.
Batch processinng

We can say, the batch processing system

• Access to all data.
• Might compute something big and complex.
• Generally, it is very concerned with throughput rather than
the latency of individual components of the computation.
• Batch processing has latency measured in minutes or more.
Pros & Cons of Batch Processing
Advantages of Batch Processing
• Ideal for processing large volumes of data/transaction. It also
increases efficiency rather than processing each individually.
• Process during less-busy times or at a desired designated time.
• Cost efficiency.
Disadvantages of Batch Processing
• The time delay between the collection of data and getting the
result after the batch process. => Latency
• Master file is not always kept up to date.
• Hadoop MR: No memory/cache management (disk access
slows down the process)
Hadoop MapReduce
Questions
• Is Mapreduce paradigm able to process all kind of
big data use cases?
• Data stream?
• If high latency is not permitted?
When to use Batch processing?
To generalize, you should lean towards batch processing when:
• Data freshness is not a mission-critical issue
• You are working with large datasets and are running a
complex algorithm that requires access to the entire batch –
e.g., sorting/averaging the entire dataset, generating a ML
model
• Global treatment on the entire data. Example, Decision
making in BI solutions (Creating reports needs an overview of
the entire data/interesting amount of data)
• You get access to the data in batches rather than in streams
• When you are joining tables in relational databases
Stream Processing
• Instead of browsing/processing the entire data once entirely stored,
we process data as soon as it arrives in the storage layer
• Often very close to the time in which it was generated (although
this would not always be the case).

• Could be asynchronous: no requirement regarding real-time results

availability
Stream Processing
• Usually involve a relatively simple transformation or
calculation to guarantee live results and avoid congestion
• Data stream processing: each event/data portion is
processed with no need of a global visibility
Low latency but less important throughput
• Example : filtering tweets and counting how much time
our product was cited : No batch but stream processing
When to use stream processing?
• While stream processing and real-time processing are not
necessarily synonymous, we would use stream processing
when we need to analyze or serve data as close as possible to
when we get hold of it.
• Data generated in a continuous stream and arriving at high velocity
• Sub-second latency is crucial
Potential data loss : Need of queues to store temporarily
upcoming data while waiting to be processed => consider a data
ingestion system (MOM/Broker)
Examples : Storm(Twitter), Flink, Kafka Streams, Samza (LinkedIn)
Stream Processing solutions/architectures
Real-Time Processing
• Involves continuous input, process, and output of data.
• Processes in a short period of time.
• Every transaction/processing result is directly reflected in
the master file, so that it will always be up-to-date
synchronous APIs

• For tasks like fraud detection, real-time processing is very

useful. By processing transaction data, we can
detect/signal fraud in real time, stop fraudulent
transactions before they take place.
Real-Time Processing
• Helps to compute a function of one data element. Also,
can say it processes a small window of recent data.
• Computes something relatively simple
• When there is a need to compute in near-real-time, only
seconds at most, we go for real-time processing.
• In real-time processing, computations are generally
independent (no aggregation for example)
Pros and Cons of Real-time
processing
Advantages of Real-Time Processing
• No significant delay in response.
• Information is always up to date. Hence, it makes
the organization able to take immediate action or
respond to an event/issue in the shortest possible
span of time.
• Gain insights from the updated data, detect patterns
of either opportunities or threats.
Micro-batch Processing
• Latency/Throughput tradeoff
Incoming tasks to be executed are grouped into small
batches to achieve some of the performance advantage
of batch processing (high throughput), without too
much increasing the latency for each task completion.
Typically applied in systems where the amount of
incoming tasks is variable: The system grab whatever
incoming tasks have been received and execute them in
a batch. This process is executed repeatedly.
Micro-batch Processing
Variant of batching which attempts to strike a better compromise
between latency and throughput than batching does.
How: waiting short time intervals (milliseconds or more) OR a
batch size threshold to pile up tasks before processing
them: batch cycle.

Execution engine has restrained visibility on the batch cycle

Micro-batch Processing
• Examples : Spark Streaming, Storm-Trident.
• Does not feel like a natural streaming: intermittent
processing depending on the batch cycle and on
the required processing.
Interactive Processing/Applications
• Who interacts with whom?
System interacts with user (other system)
Synchronous
• Which treatments could be interactive?
Batch processing? No! due to latency
Streaming ! Yes, but requires an In Memory
Treatment (Spark) for lower latency
Batch processing
Scripting languages
How to Analyze Large Data Sets in
Hadoop
• Although the Hadoop framework is implemented in
Java, MapReduce applications do not need to be
written in Java

• To abstract complexities of Hadoop programming

model, a few application development languages
have emerged that build on top of Hadoop:
• Pig
• Hive
• Jaql
Jaql
Pig, Hive, Jaql – Similarities
• Reduced program size over Java

• Applications are translated to map

and reduce jobs behind scenes

• Extension points for extending

existing functionality

• Interoperability with other

languages

• Not designed for random

reads/writes or low-latency queries
Pig, Hive, Jaql – Differences
Characteristic Pig Jaql Hive BigSQL

Developed
Yahoo! IBM Facebook IBM
by

Language Pig Latin Jaql HiveQL Ansi-SQL

Type of
Data flow Data flow SQL SQL
language

Data
JSON, semi Mostly Mostly
structures Complex
structured structured structured
supported

Schema Optional Optional Mandatory Mandatory

Pig
• The Pig platform is able to handle many kinds of data,
hence the name
• Pig Latin is a data flow language
• Two components:
• Language Pig Latin
• Runtime environment

• Two execution modes:

• Local
• Good for testing and prototyping pig –x local

• Distributed (MapReduce)
• Need access to a Hadoop cluster and HDFS pig –x mapreduce

• Default mode
Pig
• Three steps in a typical Pig program:
• LOAD
• Load data from HDFS
• TRANSFORM
• Translated to a set of map and reduce tasks
• Relational operators: FILTER, FOREACH, GROUP, UNION, etc.
• DUMP or STORE
• Display result on to the screen or store it in a file
• Pig data types:
• Simple types:
• int, long, float, double, chararray, bytearray, boolean
• Complex types: (John,18)
• tuple: ordered set of fields
{(John,18), (Mary, 29)}
• bag: collection of tuples
• map: set of key/value pairs [name#John, phone#1234567]
Pig
• Example: wordcount.pig
input = LOAD ‘./all_web_pages’ AS (line:chararray);

-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- create a group for each word

word_groups = GROUP words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(words) AS count, group;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO ‘./word_count_result’;

 How to run wordcount.pig?

– Local mode: /bin/pig -x local wordcount.pig

– Distributed mode (MapReduce):

hadoop dfs -copyFromLocal all_web_pages input/all_web_pages
bin/pig -x mapreduce wordcount.pig
Hive
• Hive : developed by Facebook
• What is Hive?
• Data warehouse infrastructure built on top of Hadoop
• Provides an SQL-like language called HiveQL
• Allows SQL developers and business analysts to leverage existing SQL skills
• Offers built-in UDFs and indexing
• What Hive is not?
• Not designed for low-latency queries, unlike RDBMS such as DB2 and
Netezza
• Not schema on write : Schema on read instead
• Not for OLTP
• Not fully SQL compliant, only understand limited commands
Big Difference: Schema on Run
 Regular database  Big Data (Hadoop)
– Schema on load – Schema on run

Raw data
Raw data

Schema Storage
to filter (unfiltered,
raw data)

Schema
to filter

Storage
(pre-filtered data) Output
Hive
Benefits of schema on read:
• Flexibility in defining how your data is interpreted at load time
‾ This gives you the ability to evolve your "schema" as time goes on
‾ This allows you to have different versions of your "schema"
‾ This allows the original source data format to change without having to
consolidate to one data format
• You get to keep your original data
• You can load your data before you know what to do with it
• Gives you flexibility in being able to store unstructured, unclean,
and/or unorganized data
Hive
Downsides of schema on read:
• Less efficient because you have to reparse and reinterpret the
data every time (this can be expensive with formats like XML)
• Data is not self-documenting (i.e., you can't look at a schema to
figure out what the data is)
• More error prone and your analytics have to account for dirty
data
Hive
Components
• Shell : interface for users to submit queries and other operations
to the system.
• Driver : implements the notion of session handles and provides
execute and fetch APIs modeled on JDBC/ODBC interface
• Compiler : semantic analysis on the different query blocks and
query expressions and eventually generates an execution plan
• Engine : executes the execution plan. The plan is a DAG of stages.
• Metastore : stores all the structure information of tables and
partitions in the warehouse (column and column type
information, serializers and deserializers necessary to read and
write data, corresponding hdfs files where data is stored.
Hive
Data models:
Tables Analogous to tables in RDBMS, composed of columns
Partitions For optimizing data access, e.g. range partition tables by date
Buckets Data in each partition may in turn be divided into Buckets
based on the hash of a column in the table

DB HDFS

Table Directory

Partitions
(Sub-directories)

Buckets
(Files)
Hive
• Example: movie ratings analysis
-- create a table with tab-delimited text file format
hive> CREATE TABLE movie_ratings (
userid INT,
movieid INT,
rating INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- load data
hive> LOAD DATA INPATH ‘hdfs://node/movie_data' OVERWRITE INTO
TABLE movie_ratings;

-- gather ratings per movie

hive> SELECT movieid, rating, COUNT(rating)
FROM movie_ratings
GROUP BY movieid, rating;

Ccunit 5
No ratings yet
Ccunit 5
4 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
3 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
13 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges
No ratings yet
Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges
7 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Lec 02
No ratings yet
Lec 02
13 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Big Data Analysis Apache Storm Perspecti
No ratings yet
Big Data Analysis Apache Storm Perspecti
6 pages
Spark
No ratings yet
Spark
36 pages
Bigdata Oral Assignment
No ratings yet
Bigdata Oral Assignment
23 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
SA Unit 1 PPT 1
No ratings yet
SA Unit 1 PPT 1
19 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
3 pages
BigData Materials
No ratings yet
BigData Materials
68 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
22 pages
Bigdata
No ratings yet
Bigdata
3 pages
CDA C2 R 050 en File 24.en
No ratings yet
CDA C2 R 050 en File 24.en
2 pages
Big Data Analytics - Chapter 4
No ratings yet
Big Data Analytics - Chapter 4
22 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Document 15
No ratings yet
Document 15
15 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Concept of Big Data
No ratings yet
Concept of Big Data
29 pages
Ch03 - Big Data Processing
No ratings yet
Ch03 - Big Data Processing
31 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Apache Flink Tutorial
100% (1)
Apache Flink Tutorial
44 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Real-Time Systems & Streaming Data
No ratings yet
Real-Time Systems & Streaming Data
39 pages
Articol Disteibuted Data Processing
No ratings yet
Articol Disteibuted Data Processing
9 pages
Emerging Trends and Technologies in Big
No ratings yet
Emerging Trends and Technologies in Big
14 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Processing Methodologies
No ratings yet
Processing Methodologies
13 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
21 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
V18684 QuestionBank AnswerAndExplanation
No ratings yet
V18684 QuestionBank AnswerAndExplanation
203 pages
Real-Time Data Stream Processing - Challenges and
No ratings yet
Real-Time Data Stream Processing - Challenges and
8 pages
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Big Data Processing, MapReduce
No ratings yet
Big Data Processing, MapReduce
13 pages
SA Unit 1 PPT 5
No ratings yet
SA Unit 1 PPT 5
14 pages
Types of Data Processing
No ratings yet
Types of Data Processing
3 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
A278955818 - 28953 - 15 - 2025 - Unit 1 Part 2
No ratings yet
A278955818 - 28953 - 15 - 2025 - Unit 1 Part 2
13 pages
DMML Misem Makeup Solution
No ratings yet
DMML Misem Makeup Solution
12 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Kafka
No ratings yet
Kafka
78 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Ecs765p W1
No ratings yet
Ecs765p W1
39 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Hadoop Interview Questions Guide
100% (1)
Hadoop Interview Questions Guide
34 pages
Installing Hadoop On Ubuntu
No ratings yet
Installing Hadoop On Ubuntu
29 pages
Apache Karaf System Builders Guide
No ratings yet
Apache Karaf System Builders Guide
31 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Apache Spark: Key Concepts & Features
No ratings yet
Apache Spark: Key Concepts & Features
8 pages
Tech Giants' Data Architectures
No ratings yet
Tech Giants' Data Architectures
14 pages
Business Tech Consultant Profile
No ratings yet
Business Tech Consultant Profile
3 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Continuous MapReduce for Big Data
No ratings yet
Continuous MapReduce for Big Data
22 pages
K.suhas Chandra: Professional Summary
No ratings yet
K.suhas Chandra: Professional Summary
3 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
Daniel Chin Resume
No ratings yet
Daniel Chin Resume
1 page
Ace The Data Science 2
No ratings yet
Ace The Data Science 2
235 pages
Hadoop
No ratings yet
Hadoop
5 pages
3-2 It Bda Lab Syllabus
No ratings yet
3-2 It Bda Lab Syllabus
2 pages
Data Analytics Course File 2021-22 Odd Semester
No ratings yet
Data Analytics Course File 2021-22 Odd Semester
164 pages
Big Data - Midsem
No ratings yet
Big Data - Midsem
526 pages
Avinash Pulapaka Hadoop
No ratings yet
Avinash Pulapaka Hadoop
6 pages
99 Apache Spark Interview Questions For Professionals
33% (12)
99 Apache Spark Interview Questions For Professionals
11 pages
An Introduction To Amazon Redshift
No ratings yet
An Introduction To Amazon Redshift
10 pages
CC Labs All
No ratings yet
CC Labs All
184 pages
EDS Technologies PVT LTD - Battle Card
No ratings yet
EDS Technologies PVT LTD - Battle Card
6 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
7th Sem Syllebux PDF
No ratings yet
7th Sem Syllebux PDF
22 pages
Unit 06 - Assignment 1 Frontsheet
No ratings yet
Unit 06 - Assignment 1 Frontsheet
30 pages
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
No ratings yet
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
5 pages
Google Cloud Computing Foundations: Data, ML, and AI in Google Cloud
No ratings yet
Google Cloud Computing Foundations: Data, ML, and AI in Google Cloud
15 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Install Elasticsearch With Docker - Elasticsearch Guide (8.17) - Elastic
No ratings yet
Install Elasticsearch With Docker - Elasticsearch Guide (8.17) - Elastic
14 pages
Data Handling & Analytics: Unit 5
No ratings yet
Data Handling & Analytics: Unit 5
18 pages

C42-Batch Stream Micro Batch Realtime Processing

Uploaded by

C42-Batch Stream Micro Batch Realtime Processing

Uploaded by

Big Data processing

We can say, the batch processing system

• Could be asynchronous: no requirement regarding real-time results

• For tasks like fraud detection, real-time processing is very

Execution engine has restrained visibility on the batch cycle

• To abstract complexities of Hadoop programming

• Applications are translated to map

• Extension points for extending

• Interoperability with other

• Not designed for random

Language Pig Latin Jaql HiveQL Ansi-SQL

Schema Optional Optional Mandatory Mandatory

• Two execution modes:

-- create a group for each word

-- count the entries in each group

-- order the records by count

 How to run wordcount.pig?

– Distributed mode (MapReduce):

-- gather ratings per movie

You might also like