0% found this document useful (0 votes)

185 views118 pages

Course Introduction To Big Data (2021-2022)

This document provides an introduction to big data from the University Politehnica of Bucharest's Computer Science Department. It discusses key concepts like the human impact of big data, challenges of big data research, benefits of big data, and briefly introduces distributed platforms and cloud computing. Videos and articles are referenced to provide additional background on topics like what is big data, expectations of big data, and the data and AI landscape.

Uploaded by

Mihaela Rînja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views118 pages

Course Introduction To Big Data (2021-2022)

Uploaded by

Mihaela Rînja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 118

Introduction to University Politehnica of Bucharest – Computer Science Department

Big Data

Introduction to Big Data

2021-2022

Prof.dr.ing. Florin Pop

florin.pop@cs.pub.ro
Dr.ing. Cătălin Negru
catalin.negru@cs.pub.ro
Dr.ing. Bogdan Mocanu
mocanu.bogdan.costel@gmail.ro

SCPD • SSA • ABD • SEM

Introduction to Big Data 1
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The Human Face of Big Data

• Big Data will impact every part of your life: https://www.youtube.com/watch?v=0Q3sRSUYmys

• What Exactly Is Big Data? https://www.forbes.com/video/4857597029001

Introduction to Big Data 2

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Gartner Tempers The Expectations Of Big Data

Introduction to Big Data 3

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The 2017 Big Data Landscape

Introduction to Big Data 4

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data & AI Landscape 2020

Introduction to Big Data 5

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Understanding Big Data

2.5 quintillion bytes of data each day

Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly
used tools to capture, process, store, manage and analyze them can be termed as BIG DATA.

Introduction to Big Data 6

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Complex situations?
• Data distribution
– The large data set is split into chunks or smaller blocks and
distributed over N number of nodes or machines.
– Distributed File System.
• Parallel processing
– The distributed data gets the power of N number of servers and
machines in which data is residing and works in parallel for the
processing and analysis
– MapReduce (Google), Hadoop, Spark, Flink, etc.
• Fault tolerance
– We keep the replica of a single block (or chunk) of data more than
once.
• Commodity hardware
– We don’t need specialized hardware with special RAID as Data
container.
• Flexibility and Scalability
– add more and more of rack space into the cluster as the demand for
space increases.

Introduction to Big Data 7

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

A brief history of Data Science

Introduction to Big Data 8

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Timeline v2.0

Illustration by Héizel Vázquez

Introduction to Big Data 9

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Research Challenges

• Heterogeneity and incompleteness
– machine analysis algorithms expect homogeneous data;
– even after data cleaning and error correction, some incompleteness and some errors in data
are likely to remain.
• Scale
– managing large and rapidly increasing volumes of data;
– clock speeds have largely stalled and processors are being built with increasing numbers of
cores;
– parallelism across nodes in a cluster;
– move towards cloud computing;
– large clusters requires new ways of determining how to run and execute data processing jobs.
• Timeliness
– acquisition rate challenge and a timeliness challenge;
– many situations in which the result of the analysis is required immediately;
– ex: a full analysis of a user’s purchase history is not likely to be feasible in real-time.
• Privacy
– important to rethink security for information sharing in Big Data use cases;
– we do not understand what it means to share data.
• Human collaboration
– analytics for Big Data will be designed to have a human in the loop;
– crowd-sourcing;
– issues of uncertainty and error => participatory-sensing;
– the extra challenge here is the inherent uncertainty of the data collection devices.

Introduction to Big Data 10

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Major Benefits of Big Data

• Answer more questions, more completely (powerful big data

business intelligence platform)

• Faster and better decision making

• Become confident in your accurate data (solving the

problem of inaccurate, incomplete view of the data)

• Empower a new generation of employees (data scientists)

• Cost reduction and revenue generation

Introduction to Big Data 11

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Remark

We must to support and we need to encourage

fundamental research towards addressing all scientific and

technical challenges

if we want to achieve the promised benefits of

Big Data.

Introduction to Big Data 12

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction in Big Data.

Distributed platforms
Brief introduction

Datacenters, Cloud computing, HPC, Fog and

Edge computing; Mobile computing; Vanet (V2X).
Distributed Applications (practical use cases):
Smart Cities; eHealth; eGovernment; Scientific
Applications

Introduction to Big Data 13

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Datacenters
• Processing Big Data, the Google Way

Photo: GOOGLE Photo: GOOGLE

Twin banks of servers at the data center in Douglas County, Ga. Server racks at the data center in Mayes County, Okla.. Each server
use energy-thrifty blue LEDs as monitor lights. has four switches connected by a different colored cable. Google
keeps the colors the same throughout our data center so it knows
which one to replace in case of failure.

Introduction to Big Data 14

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Cloud Computing
• Definition by IBM:
– Cloud computing, often referred to as simply “the cloud,” is the delivery
of on-demand computing resources—everything from applications to data
centers—over the internet on a pay-for-use basis.
• Cloud Computing Services:
– Software as a Service (SaaS):
• applications are supplied by the cloud provider;
• applications are accessible from various client devices (e.g. web browser, API);
• Examples: GMAIL, Dropbox, Office 365, Amazon Web Services
– Platform as a Service (PaaS)
• deploy applications onto the Cloud infrastructure
• programming languages, tools and libraries are supported by the provider;
• control over the deployed applications
– Infrastructure as a Service (IaaS)
• provision with fundamental computing resources: VM’s (CPU, storage, network);
• resources are distributed and support dynamic scaling;
• deploy and run arbitrary software, meaning middleware, and operating systems;
• control over operating systems, storage, and deployed applications
• User does not manage or control the underlying Cloud infrastructure.

Introduction to Big Data 15

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Cloud Computing - Deployment models

• Private
– Single-tenant architecture
– On-premises hardware
– Direct control of infrastruscure
– Ex: IBM, Red Had, Microsoft, OpenStack,
ownCloud
• Public
– Multi-tenant architecture
– Pay-as-you-go pricing model
– Ex: AWS, Microsoft Azure, Google Cloud
Platform
• Hybrid
– Cloud bursting capabilities
– Benefit of both public and private environments
– Ex: combination of both public and private
providers

Introduction to Big Data 16

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HPC
• High Performance Computing most generally refers to the
practice of aggregating computing power in a way that
delivers much higher performance than one could get out
of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business.

Introduction to Big Data 17

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fog and Edge Computing

Introduction to Big Data 18

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Mobile computing

Introduction to Big Data 19

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Vanet (V2X)-Vehicular ad hoc networks

• Vehicle-to-Vehicle (V2V) among vehicles
– A wireless ad hoc network on the roads, suited for short range vehicular networks;
• Vehicle-to-Infrastructure (V2I) between vehicles and Road-side-Units
– Long range vehicular networks;
• Vehicle-to-X (V2X) mixed approach
– Vehicle-to-everything

Introduction to Big Data 20

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data Science and the Values of

Big Data

The 8V’s: Volume, Velocity, Variety, Variability,

Visualization, Value, Veracity, Vicissitude;
Use cases and current challenges

Introduction to Big Data 21

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Starting V

Introduction to Big Data 22

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Volume
• Big Data Volume refers to the large amounts of generated
data:
– 40 ZettaBytes by 2020;
• 300 times more than 2005;
– ~ 2.5 Quintillion (2.3 Trillions GB) bytes data created daily;
– 100 Terabytes per company stored;
– 6 billion cell phones;
– 7 billon peoples.
• Volume challenges:
– Storage, analysis and processing;
– Even the smallest optimization represents a step ahead in getting
the meaningful data in time and at a low cost;

Introduction to Big Data 23

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Velocity
• Velocity refers to the speed of generating data (the speed at
which the data is flowing):
– New York Stock Exchange captures 1TB of trade information
during each trading session;
– Modern cars have about 100 sensors;
– ~18.9 billion network connections
• about 2.5 connections per person;
– Social media messages are
spread in seconds;
• Velocity challenges:
– Response time of delivery services (for financial markets is real-
time);
– Speed of manipulating and analyzing complex data;
– The Big Data system must be able to adjust and deal with the high
loads of data at peak times;

Introduction to Big Data 24

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Variety
• Variety refers to the diversity of data:
– 150 Exabytes in healthcare (2011);
– ~30 billion pieces of contend shared on
Facebook;
– By 2014 ~420 mil. wearable, wireless
health monitors;
– 4 billion hours of videos watched on YouTube per month;
– 400 million tweets send /day by 200 mil. active users;
• Variety challenges:
– Big Data systems should handle complex data: relational data, raw data,
semi-structured or unstructured data;
– Almost 80% of the world’s data is now unorganized;
• traditional DB can’t be used anymore for storage and management
– different types of data must be processed and mapped to dedicated
resources;

Introduction to Big Data 25

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Variability
• Variability refers to data whose
meaning is constantly changing;
• For example words don’t have static
definitions and meaning vary in context:
– “Delicious muesli from the
@imaginarycafe- what a great
way to start the day!”
– “Greatly disappointed that my
local Imaginary Cafe have
stopped stocking BLTs.”
– “Had to wait in line for 45
minutes at the Imaginary Cafe
today. Great, well there’s my
lunchbreak gone…”
– “great” on its own is not a sufficient
signifier of positive sentiment;
• Variability challenges:
– ‘Understand’ context and decode the
precise meaning of words

Introduction to Big Data 26

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Visualization
• Visualization refers to presentation of data in a pictorial or
graphical format:
– Analytics presented visually:
• to understand difficult concepts;
• to identify new patters;
– Interactive visualization:
• charts;
• graphs;
• ex: https://d3js.org/, https://developers.google.com/chart/ ,
http://www.fusioncharts.com/,
• Visualization challenges:
– Develop new techniques for data visualization;
– Data exploration;
– Inconsistencies in the data structure;

Introduction to Big Data 27

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Value
• Value refers to the value that can be extracted form datasets:
– it does not matter how big the data volume is or how complex,
– it is unless we are able to extract the meaningful information;
– storing and processing meaningless data represents:
• a waste of money, time,
and obtaining the relevant information becomes harder;
– Traditional approach to Big Data: systems for storing big data, but getting
insights can be difficult:
– time-intensive, manual query and analysis;
– modern dashboarding tools;
– Modern approach to Big Data:
• Machine brains: specialized algorithms that :
– can learn to identify patterns,
– correlations and
– indicators through training and repetition;
• Big Data Value challenges:
– developing tools for getting value form Big Data;

Introduction to Big Data 28

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Veracity
• Veracity refers to the trustworthiness of the data (uncertainty of data):
– 1 in 3 business leaders don’t trust the information they use to make
decisions;
– 27% of respondents in one survey were unsure how much of their data
was inaccurate;
– Poor data quality cost US economy about $1.3 trillion/year;
• Causes of veracity:
– Rumors, spammers, collection errors,
entry errors, system errors;
• Veracity challenges:
– the volume and complexity of data
are increasing;
– quality and accuracy are becoming
less controllable;

Introduction to Big Data 29

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Vicissitude
• Vicissitude property refers to the challenge of scaling Big Data
complex workflows;

BTWorld: A Large-scale Experiment in Time-Based

Analytics

Introduction to Big Data 30

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

+ more Vs of Big Data

+ Volatility
+ Veridicity
+ Vision
+…

Introduction to Big Data 31

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

10 Cool Big Data Projects

• #1. To find exactly what we look for in the internet
• #2. To ride through a city without traffic jams
• #3. To save rare animals, catching poachers
• #4. To make our cities green
• #5. To understand why Indian cuisine is unique
• #6. To fight malaria epidemics in Africa
• #7. To grow ideal Christmas trees
• #8. To understand that our languages are filled with happiness
• #9. To make sport shows even more interesting
• #10. To improve job conditions
• #11. To enhance relationship

https://www.kaspersky.com/blog/cool-big-data-projects/8186/
Introduction to Big Data 32
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed Model for Scalable Batch

Computing

Google MapReduce. MapReduce Patterns,

Algorithms and use cases

Introduction to Big Data 33

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Philosophy to Scale for Big Data

Divide Work

Combine Results

Introduction to Big Data 34

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed processing is non-trivial

• How to assign tasks to different workers in an efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated to different
workers?
• Big Data storage is challenging
– Data Volumes are massive
– Reliability of Storing PBs of data is challenging
– All kinds of failures: Disk/Hardware/Network Failures
– Probability of failures simply increase with the number of
machines …

Introduction to Big Data 35

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Google MapReduce
• MapReduce is:
– a programming model;
– implementation for processing and generating big data sets;
• Released in 2004 by Google (Jeffrey Dean and Sanjay
Ghemawat)
• Perform simple computations while hiding the details of:
– Parallelization;
Large Dataset
R
Split data
– Data distribution; E
Split data M
D
– Load balancing; A Results
Split data U
… P
• I/O scheduling; C
Split data E
– Fault tolerance;
• Distribute data over multiple nodes
– Each node receive a small portion of data

Introduction to Big Data 36

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Google MapReduce (2)

• MapReduce programming model:
– Input & Output: each a set of key/value pairs
• Programmer must specifies two functions:
– map (in_key, in_value) à list(out_key, intermediate_value)
• Processes input key/value pair;
• Produces set of intermediate pairs;
– reduce (out_key, list(intermediate_value)) à list(out_value)
• Combines all intermediate values for a particular key;
• Produces a set of merged output values;

reduce(String output_key, Iterator

map(String input_key, String intermediate_values):
input_value): // output_key: a word
// input_key: document name // output_values: a list of counts int
// input_value: document contents f result = 0;
for each word w in input_value: for each v in intermediate_values:
EmitIntermediate(w, "1"); result += ParseInt(v);
Emit(AsString(result));

Introduction to Big Data 37

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Architecture

Introduction to Big Data 38

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Data Flow

Introduction to Big Data 39

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns
• Counting and Summing - log analysis, data querying:
– Calculate a total number of occurrences of each word in a large set of
documents;
– Process a log file :
• each record contains a response time;
• it is required to calculate an average response time.
• Collating – ETL, building of inverted indexes:
– Save all items that have the same value of a function into one file
• Filtering (“Grepping”), Parsing, and Validation - log analysis, data
querying, ETL, data validation:
– Collect all records that meet some condition;
– Transform each record (independently from other records) into another
representation

Introduction to Big Data 40

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns (2)

• Distributed Task Execution – physics and engineering simulations,
numerical analysis, performance testing:
– Large computational divided into multiple parts;
– Results from all parts are combined together to obtain a final result.
• Sorting - ETL, data analysis:
– Sort large set of records by some rule;
– Process records in a certain order.
• Iterative Message Passing (Graph Processing) - graph analysis,
web indexing:
– Calculate a state of each entity in a network on the basis of properties of
the other entities in its neighborhood;
– State can represent :
• distance to other nodes;
• indication that there is a neighbor with the certain properties;
• characteristic of neighborhood density ;
• etc.

Introduction to Big Data 41

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns (3)

• Distinct Values (Unique Items Counting) – log analysis, unique users
counting
– Set of records that contain fields F and G;
– Count the total number of unique values of field F for each subset of records that
have the same G (grouped by G).
• Cross-Correlation -text analysis, market analysis:
– A set of tuples of items ;
– Calculate a number of tuples where some items co-occur for each possible pair.
• Relational MapReduce Patterns:
– Selection
– Projection
– Union
– Intersection
– Difference
– GroupBy and Aggregation
– Joining

Introduction to Big Data 42

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Algorithms and use cases

• PageRank and Mapper-Side Data Aggregation
– Used by Google to calculate relevance of a web page as a function of
authoritativeness (PageRank) of pages that have links to this page.
– the algorithm is complex
– its core it is just a propagation of weights between nodes
• each node calculates its weight as a mean of the incoming weights:
class N
State is PageRank
method getMessage(object N)
return N.State / N.OutgoingRelations.size()
method calculateState(state s, data [d1, d2,…])
return ( sum([d1, d2,...]) )

class Mapper class Reducer

method Initialize method Reduce(id m, [s1, s2,...])
H = new AssociativeArray
method Map(id n, object N)
M = null
p = N.PageRank / p = 0
N.OutgoingRelations.size() for all s in [s1, s2,...] do
Emit(id n, object N) if IsObject(s) then
for all id m in N.OutgoingRelations M = s
do else
H{m} = H{m} + p
method Close p = p + s
for all id n in H do M.PageRank = p
Emit(id n, value H{n}) Emit(id m, item M)
Introduction to Big Data 43
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Example

• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster

Owen O’Malley (Yahoo!)

Introduction to Big Data 44
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Mapper

public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,

OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}

Introduction to Big Data 45

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Reducer

public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {

public static void reduce(Text key, Iterator<IntWritable>

values, OutputCollector<Text,IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Introduction to Big Data 46

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Putting it all together

JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Introduction to Big Data 47

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Hadoop

Programming environment and use cases

* Based on slides by Dhruba Borthakur, Owen O’Malley (Yahoo!)

Introduction to Big Data 48

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The history of an elephant

• Hadoop is an open-source implementation based on Google
File System (GFS) and MapReduce from Google
• Hadoop was created by Doug Cutting and Mike Cafarella in
2005
• Hadoop is an Apache open source project from 2006
• Why Hadoop?
– Need to process huge datasets on large clusters of computers
– Very expensive to build reliability into each application
– Nodes fail every day
• Failure is expected, rather than exceptional
• The number of nodes in a cluster is not constant
– Need a common infrastructure
• Efficient, reliable, easy to use

Introduction to Big Data 49

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Hadoop Basic Modules

Computation

Storage

Introduction to Big Data 50

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed File System

• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple Data Nodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from Data Node

Introduction to Big Data 51

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS)
paper, 2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files
– Random access to files
– Low latency access

Introduction to Big Data 52

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Design of HDFS
• HDFS files are divided into blocks
• Master-Slave design
– It’s the basic unit of read/write
• Master Node – Default size is 64MB, could be larger (128MB)
– Single NameNode for managing metadata – Hence makes HDFS good for storing larger
files
• Slave Nodes • HDFS blocks are replicated multiple times
– Multiple DataNodes for storing data – One block stored at multiple location, also at
• Other different racks (usually 3 times)
– This makes HDFS storage fault tolerant and
– Secondary NameNode as a backup faster to read

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode NameNode keeps the

metadata, the name,
location and directory
DataNode DataNode DataNode DataNode DataNode provide storage
for blocks of data
Heartbeat, Commands, Data

Introduction to Big Data 53

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Architecture

Introduction to Big Data 54

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Blocks

• Reasons:
– File can be larger than a single disk
– Block is of fixed size, easy to manage and manipulate
– Easy to replicate and do more fine grained load balancing
• HDFS Block size is by default 64 MB (larger than regular file
system block)
– Minimize overhead: disk seek time is almost constant
– Example: seek time: 10 msec, file transfer rate: 100MB/s,
overhead (seek time/a block transfer time) is 1%, what is the block
size?
– 100 MB (HDFS -> 128 MB)

Introduction to Big Data 55

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc.

Introduction to Big Data 56

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

Introduction to Big Data 57

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Block Placement and Heartbeats

• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
• DataNodes send hearbeat to the NameNode
– Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure

Introduction to Big Data 58

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fault tolerance support in HDFS

• Replication of blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node

B1 B2 B4 B3

Node Node Node

B1 Node
B3 B1 B2 B4

Node Node Node Node

B4 B3 B1 B2

Introduction to Big Data 59

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Name Node

Snapshot of FS Edit log: record
Name Node
changes to FS

Filename Replication factor Block ID

File 1 3 [1, 2, 3]

File 2 2 [4, 5, 6]

File 3 1 [7,8]

Data Nodes

1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6

Introduction to Big Data 60

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Read

1
Name Node

Client
2

• Prevent NN from being

3 4
the bottleneck of the
cluster
... • Allow HDFS to scale to
DN1 DN2 DN3 DNn
large number of
concurrent clients
• Spread the data traffic
across the cluster
1. Client connects to NN to read data
2. NN tells client where to find the data blocks
3. Client reads blocks directly from data nodes (without going through NN)
4. In case of node failures, client connects to another node that serves the
missing block

Introduction to Big Data 61

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Write

1
Name Node
Tradeoffs: Reliability,
Write Bandwidth and
Client
Read Bandwidth
2
4

DN1 DN2 DN3 ... DNn

1. Client connects to NN to write data

2. NN tells client write these data nodes
3. Client writes blocks directly to data nodes with desired replication factor
4. In case of node failures, NN will figure it out and replicate the missing blocks

Introduction to Big Data 62

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Example of HDFS shell commands

Create a directory in HDFS
• hadoop fs -mkdir /user/bigdata/dir1

List the content of a directory

• hadoop fs -ls /user/bigdata

Upload and download a file in HDFS

• hadoop fs -put /home/bigdata/file.txt /user/bigdata/datadir/
• hadoop fs -get /user/bigdata/datadir/file.txt /home/

Look at the content of a file

• hadoop fs -cat /user/bigdata/datadir/book.txt

Many more commands, similar to Unix

Introduction to Big Data 63

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Who uses Hadoop?

• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more

Hadoop Cluster at Yahoo! (Credit: Yahoo)

Introduction to Big Data 64

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

A real world example of New York Times

• Goal: Make entire archive of articles available online: 11
million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud (EC2) and
Simple Storage System (S3)
• Time: < 24 hours
• Costs: ~ $240

Introduction to Big Data 65

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Hadoop Ecosystem
• Pig: High-level language for data analysis (Yahoo! Research)
• Hive: SQL-like Query language and Metastore (Facebook)
• HBase: Table storage for semi-structured data (Modeled on
Google’s Bigtable)
• Zookeeper: Coordinating distributed applications
• Mahout: Machine learning

Introduction to Big Data 66

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

In-Memory Cluster Computing

Apache Spark

Introduction to Big Data 67

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Overview of LAMBDA Architecture

• The Lambda Architecture has three major components:
– Batch layer provides the following functionality
• managing the master dataset, an immutable, append-only set of raw data
• pre-computing arbitrary query functions, called batch views.
– Serving layer
• indexes the batch views so that they
can be queried in ad hoc with low latency.
– Speed layer
• accommodates all requests that are
subject to low latency requirements.
• use fast and incremental algorithms
• deals with recent data only.

Introduction to Big Data 68

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction to Big Data 69

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Spark
• Spark is a general-purpose computing framework for iterative
tasks
• API is provided for Java, Scala and Python
• The model is based on MapReduce
– enhanced with new operations and an engine that supports
execution graphs
• Tools include Spark SQL, MLLlib for machine learning, GraphX
for graph processing and Spark Streaming

• Unifies batch, streaming,

interactive computing
• Making it easy to build
sophisticated applications

Introduction to Big Data 70

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Key idea in Spark

• Resilient distributed datasets (RDDs)
– Immutable collections of objects across a cluster
– Built with parallel transformations (map, filter, …)
– Automatically rebuilt when failure is detected
– Allow persistence to be controlled (in-memory operation)
• Transformations on RDDs
– Lazy operations to build RDDs from other RDDs
– Always creates a new RDD
• Actions on RDDs
– Count, collect, save

Introduction to Big Data 71

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Spark overview

Introduction to Big Data 72

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

YARN

Introduction to Big Data 73

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MESOS Architecture

Apache Mesos is an open source cluster manager that handles workloads in a

distributed environment through dynamic resource sharing and isolation.

Introduction to Big Data 74

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Task Scheduler
• Supports general task graphs
• Pipelines functions where possible
• Cache-aware data reuse & locality
• Partitioning-aware to avoid shuffles
• MESOS provides resource
allocation (offer resources to
framework, accept/reject by
framework scheduler)

Introduction to Big Data 75

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Implementing Spark Algorithms

• Broadcast everything
– Master broadcasts data and initial models
– At each iteration updated models are broadcast by master (driver
program)
– Does not scale well due to communication overhead
• Data parallel
– Worker loads data
– Master broadcasts initial models
– At each iteration updated models are broadcast by master
– Works for large datasets, because data is available to workers
• Fully parallel
– Workers load data and they instantiate the models
– At each iteration, models are shared via join between workers
– Much better scalability
Introduction to Big Data 76
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Machine Learning and Spark

• Spark RDDs support efficient data sharing
• In-memory caching increases performance
– Reported to have performance of up to 100 times faster than
Hadoop in memory or 10 times faster on disk
• High-level programming interface for complex algorithms

Introduction to Big Data 77

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MLBase and MLlib

• MLBase has been designed for simplifying the development of
machine learning pipelines:
– MLlib is a machine learning library
– MLI (ML Developer API) is an API for machine learning
development that aims to abstract low-level details from the
developers
– MLOpt is a declarative layer that aims to automate the machine
learning pipeline
• The idea is that the system searches feature extractors and models
best fit for the ML task

Introduction to Big Data 78

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Graph-Parallel Systems
• Graph-based computation depends only on the neighbors of a
particular vertex
– “Think like a Vertex.” – Pregel (SIGMOD 2010)
• Systems with specialized APIs to simplify graph processing
– Pregel from Google
• Push abstraction: Vertex programs interact by sending messages
• Receive msgs, process, send msgs
• GraphLab
– Pull abstraction: Vertex programs access adjacent vertices and
edges
– Foreach (j in neighbours) calculate pagerank total for j

Introduction to Big Data 79

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

GraphX
• Separation of system support for each view (table, graph) involves
expensive data movement and duplication
• GraphX makes tables and graphs views of the same physical data
• The views have
their own optimized
semantics
– Table operators
inherited from Spark
– Graph operators
form relational algebra

Introduction to Big Data 80

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Performance Gains for PageRank

Revisited
• Spark is reported to be 4x faster than Hadoop
• Graphlab is 16x faster than Spark
• GraphX is roughly 3x slower than Graphlab
• GraphX is reported to compare favourably to Graphlab with
pipelines (raw -> hyperlink -> pagerank -> top 20)
• Graph structure can be exploited for significant performance
gains

Introduction to Big Data 81

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Spark Streaming
• Spark extension of accepting and processing of streaming high-
throughput live data streams
• Data is accepted from various sources
– Kafka, Flume, TCP sockets, Twitter, …
• Machine learning algorithms and graph processing algorithms can be
applied for the streams
• Similar systems
– Twitter (Storm),
Google (MillWheel),
Yahoo! (S4)

Introduction to Big Data 82

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Stream Processing: Discretized

• Streaming computation: a series of very small deterministic batch jobs
– Live stream is divided into batches of x seconds
– Each batch of data is an RDD and RDD operations can be used
– Results are also returned in batches
– Batch size as low as 0.5 seconds, results in approx. One second latency
• Can combine streaming and batch processing
Click stream processing example
Event vs. Processing Time
• There’s a difference between
even time (te) and processing
time (tp)
• Events arrive out-of order even
during normal operation
• Events may arrive arbitrary late
• Apply a grace period before
processing events
• Allow arbitrary update windows of
metrics
Introduction to Big Data 83
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

K – Architecture
• A distributed data operating system is emerging
– Supported by YARN and MESOS
• Various data services on top of this (Hadoop and Spark)
• Some services are being integrated (for example in Spark) for better
coherence and performance
• Important points
– Data format (row/column, block size)
– Network topology and data/code placement
– Algorithm structure and coordination
– Scheduling and resource management
• Big Data Frameworks are evolving
– Spark represents unification of streaming, machine learning and graphs
• Big Data pipeline management is at an early stage
– How to achieve better mapping between cluster resources, scheduling, and
pipelines?

Introduction to Big Data 84

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Frameworks for real-time data stream

analytics

Apache Storm

Introduction to Big Data 85

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Storm Architecture

• Nimbus: Supervisor

– Distributes codes around cluster

– Assigns tasks to machines supervisors ZooKeeper
Supervisor
– Failure monitoring
– Is fail-fast and stateless Nimbus ZooKeeper
Supervisor
• Supervisor:
– Listens for work assigned to its machine
ZooKeeper
– Starts and stops worker processes Supervisor
– Shuts down worker processes
• Topology: process messages forever
Supervisor
– A running topology consists of many worker processes spread across
many machines (graph of computation)
• Zookeper: Supervisor

– Handle cluster coordination

– Monitor working node status
– helps the Supervisor to interact with the Nimbus
Introduction to Big Data 86
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Storm
• Simple programming model
– Topology - computation graph that run forever
• node in a topology contains processing logic
• links between nodes indicate how data should be passed around
between nodes
– Spouts - source of streams
– Bolts - consumes input streams
• Programming language agnostic
– ( Clojure, Java, Ruby, Python default )
• Fault-tolerant
• Horizontally scalable
– Ex: 1,000,000 messages per second on a 10 node cluster
• Guaranteed message processing
• Fast : Uses zeromq message queue
• Local Mode : Easy unit testing
Introduction to Big Data 87
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Stream and Spouts

• Stream
– Unbounded sequence of tuples ( storm data model ) • <key,
value(s)> pair ex. <“UIUC”, 5>

• Spouts
– Source of streams : Twitterhose API
– Stream of tweets or some crawler
– have interfaces that you implement to run application-specific
logic

Introduction to Big Data 88

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Bolts
• Bolts:
– Process (one or more) input stream and produce new streams
• Functions
– Filter, Join, Apply/Transform etc
– Parallelize to make it fast! – multiple processes constitute a
bolt

Introduction to Big Data 89

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Topology & Grouping

• Topology:
– Graph of computation – can have cycles
– Network of Spouts and Bolts
– Spouts and bolts execute as many tasks across the
cluster
• Grouping
– How to send tuples between the components / tasks?
• Shuffle Grouping
– Distribute streams “randomly” to bolt’s tasks
• Fields Grouping
– Group a stream by a subset of its fields
• All Grouping
– All tasks of bolt receive all input tuples useful for joins
• Global Grouping
– entire stream goes to a single one of the bolt's
tasks
– Pick task with lowers id
Introduction to Big Data 90
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Zookeeper
• Open source server for highly reliable distributed coordination.
• As a replicated synchronization service with eventual
consistency;
• Features:
– Robust
– Persistent data replicated across multiple nodes
– Master node for writes
– Concurrent reads
– Comprises a tree of znodes, - entities roughly representing file
system nodes.
– Use only for saving small configuration data.

Introduction to Big Data 91

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Message Processing

• Guranteed Message Processing:
– Message is "fully processed" when the tuple tree has been
exhausted and every message in the tree has been processed
– A tuple is considered failed when its tree of messages fails to be
fully processed within a specified timeout.

Introduction to Big Data 92

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fault Tolerance APIS

• Emit(tuple, output)
– Emits an output tuple, perhaps anchored on an input tuple (first
argument)
• Ack(tuple)
– Acknowledge that you (bolt) finished processing a tuple
• Fail(tuple)
– Immediately fail the spout tuple at the root of tuple topology I there
is an exception from the database, etc.
• Must remember to ack/fail each tuple
– Each tuple consumes memory. Failure to do so results in memory
leaks.

Introduction to Big Data 93

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Fault Tolerance

• Anchoring
– Specify link in the tuple tree.
• ( anchor an output to one or more input tuples.)
– At the time of emitting new tuple
– Replay one or more tuples.
• “acker” tasks
– Track DAG of tuples for every spout
– Every tuple ( spout/bolt ) given a random 64 bit id
– Every tuple knows the ids of all spout tuples for which it exits.
• How?
– Every individual tuple must be acked.
– If not task will run out of memory!
– Filter Bolts ack at the end of execution
– Join/Aggregation bolts use multi ack .
Introduction to Big Data 94
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Failure Handling
• A tuple isn't acked because the task died: Spout tuple ids at the root of the
trees for the failed tuple will time out and be replayed.
• Acker task dies: All the spout tuples the acker was tracking will time out and be
replayed.
• Spout task dies: The source that the spout talks to is responsible for replaying the
messages.
• For example, queues like Kestrel and RabbitMQ will place all pending messages back on
the queue when a client disconnects.
• Major breakthrough : Tracking algorithm
– Storm uses mod hashing to map a spout tuple id to an acker task.
• Acker task: Stores a map from a spout tuple id to a pair of values.
• Task id that created the spout tuple
• Second value is 64bit number : Ack Val
• XOR all tuple ids that have been created/acked in the tree.
– Tuple tree completed when Ack Val = 0
• Configuring Reliability
– Config.TOPOLOGY_ACKERS to 0.
– you can emit them as unanchored tuples

Introduction to Big Data 95

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Failure Scenario
• Lets take a scenario:
– Count aggregation of your stream
– Store running count in database. Increment count after processing tuple.
– Failure!
• Design:
– Tuples are processed as small batches.
– Each batch of tuples is given a unique id called the "transaction id" (txid).
– If the batch is replayed, it is given the exact same txid.
– State updates are ordered among batches.
• Example: man => [count=3, txid=1]
– Processing txid = 3 dog => [count=4, txid=3]
– Database state apple => [count=10,
txid=2]
• If they're the same : SKIP man => [count=5,
["man"] ( Strong Ordering ) txid=3]
["man"] • If they're different, dog => [count=4, txid=3]
["dog"] you increment the count. apple => [count=10,
txid=2]

Introduction to Big Data 96

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Future Improvements

• Lax security policies
• Performance and scalability improvements
– Presently with just 20 nodes SLAs that require processing more
than a million records per second is achieved.
• High Availability (HA) Nimbus
• Though presently not a single point of failure, it does affect
degrade functionality.
• Enhanced tooling and language support

Introduction to Big Data 97

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Machine Learning for Big Data

The Flink model

Based on
http://linc.ucy.ac.cy/file/Talks/talks/DeepAnalysisw
ithApacheFlink_2nd_cloud_workshop.pdf

Introduction to Big Data 98

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data Processing and Machine learning Methods

• Data processing (third trend)

– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
Processing
– Tools for processing of streaming, ETL
multimedia & batch data (extract,
• Machine Learning (fourth trend) transform,
load)
– Classification
– Regression
Machine
– Clustering Big Datasets
Learning
– Collaborative filtering

Working at the Intersection of these

four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing

Introduction to Big Data 99

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Flink
• Massive parallel data flow engine with unified batch- and stream-
processing
– Batch and Stream APIs on top of a streaming engine

• Performance and ease of use

– Exploits in-memory processing and pipelining, language-embedded
logical APIs
• A runtime that "just works" without tuning
– custom memory management inside the JVM
• Predictable and dependable execution
– Bird’s-eye view of what runs and how, and what failed and why

Introduction to Big Data 100

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink System Stack

• Rich set of operators

§ Map, Reduce, Join,
CoGroup, Union, Iterate,
Delta Iterate, Filter, FlatMap,
GroupReduce, Project,
Aggregate, Distinct, Vertex-
Update, Accumulators

Introduction to Big Data 101

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink K-Means Clustering

• Cluster analysis in data mining
• Partitions n observations into k clusters
• Assign points to cluster with smallest (euclidian) distance
• Input:
– Set of observations X= {x1, x2, …, xn}
– Number of clusters K
– Convergence criteria K
• Result:
– Set of clusters C= {c1, c2, …, ck}

Init:select k data points as cluster centroids C= {c1, c2, …, ck}

Compute:
do
foreach x in X
assagin x to cluster with closest centroid
recompute centroido feach cluster
while ξi s not reached (or fixed number of iterations)
16

Introduction to Big Data 102

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

K-Means in Flink
// initialize
// points: n observations, centroids: initial k centroids
valcntrds= centroids.iterate(10) { currCntrds=>
valnewCntrds= points
.map(findNearestCntrd).withBroadcastSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 +y._2, x._3 +y._3) )
.map( x =>Centroid(x._1, x._2 /x._3) )
newCntrds }
{1, 3, 5, 7}.reduce { (x, y) =>x +y }

Introduction to Big Data 103

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink machine learning library

valfeatureExtractor=HashingFT()
valfactorizer=ALS()

valpipeline =featureExtractor.chain(factorizer)

valclickstreamDS=
env.readCsvFile[(String, String, Int)](clickStreamData)

valparameters =ParameterMap()
.add(HashingFT.NumFeatures, 1000000)
.add(ALS.Iterations, 10)
.add(ALS.NumFactors, 50)
.add(ALS.Lambda, 1.5)

valfactorization =pipeline.fit(clickstreamDS, parameters)

Currently available algorithms: Classification, Logistic Regression, Clustering,

Recommendation (ALS- alering least squares)

Introduction to Big Data 104

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Ethical Issues on Big Data

Rules and Regulations, Data Economy, Business

Models

Introduction to Big Data 105

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Ethics

• “Big Data” Revolution comparable with Industrial Revolution
– all kinds of human activities and decisions are beginning to be
influenced by big data predictions
• dating,
• shopping,
• medicine,
• education,
• voting,
• law enforcement, terrorism prevention, and cybersecurity
• Privacy
– Privacy as Information Rules
– Shared Private Information Can Remain Confidential
– Transparency
• Identity

Introduction to Big Data 106

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Ethics Principles

• Beneficial
– it should deliver value to all concerned parties
• Progressive
– deliver better and more valuable results
– minimize data usage to promote more sustainable and less risky
analysis
• Sustainable
– provide value that is sustainable over a reasonable time frame
• Data sustainability
• Algorithmic sustainability
• Device- and/or manufacturer-based sustainability
• Respectful
• Fair

Introduction to Big Data 107

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Key issues for Etics in Big Data

• Privacy
• Transparency
• Confidentiality
• Identity protection

Introduction to Big Data 108

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction to Big Data 109

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

GDPR- General Data Protection Regulation

• A piece of legislation passed by the European Union in April of 2016
• Purpose was to give consumers better control of their personal data as it’s
collected by businesses
• It sets limitations on what companies can do with that data and how long they
can keep it.
• Companies outside of the EU must still comply with the GDPR if they want to do
business with customers in the EU
• Companies can collect just the minimum amount of data from customers
needed to conduct business with the latter.
• Furthermore, businesses will be held accountable for the use and storage of
that data.
• Businesses must designate a Data Protection Officer (DPO) whose function is
to oversee details such as GDPR compliance and data security strategy.
• If there’s a breach, the DPO must report to affected individuals and the
regulators within 72 hours of the incident.
• Penalties for non-compliance are tough. A business that violates GDPR laws
will pay fines of up to four percent of annual global turnover, or 20 million Euros,
whichever is greater.
Introduction to Big Data 110
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

How GDPR Will Affect Data Collection

• GDPR legislation will affect data collection is that it will lead to an
increased reliance on real-time analytics
• Real-time analytics takes data that has just been collected and puts it
to immediate use and analysis.
• With collected data getting an immediate turnaround, there is no need
for keeping said data around for any great length of time, which is one
of the issues that the GDPR seeks to address.
• Social media, an avenue that many businesses use for the purpose of
building customer loyalty and increasing engagement, will also be
affected.
• Furthermore, all of that Big Data being collected will have to not only
be stored securely but
• Will need to be gathered by customers who want to remove it and
switch it over to another vendor.
• Businesses of all sizes will need to come to terms with the idea that
customers will gain greater control over their own personal data.
Introduction to Big Data 111
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case Studies
• Trump election
• Target pregnancy ad
• Brexit campaign
• Ted Cruz campaign surge
• UK Tory election campaign
• Obama campaign
• Air France 447 crash
• Compas system

Introduction to Big Data 112

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election

– Michal Kosinski in 2008 was accepted by Cambridge University to
do his PhD at the Psychometrics Centre
– Kosinski joined fellow student David Stillwell about a year after
Stillwell had launched a little Facebook application
“MyPersonality” app.
• enabled users to fill out different psychometric questionnaires,
including a handful of psychological questions from the Big Five
personality questionnaire
• Based on the evaluation, users received a "personality profile"—
individual Big Five values—and could opt-in to share their Facebook
profile data with the researchers
• Kosinski had expected a few dozen college friends to fill in the
questionnaire, but before long, hundreds, thousands, then millions of
people had revealed their innermost convictions.
• Suddenly, the two doctoral candidates owned the largest dataset
combining psychometric scores with Facebook profiles ever to be
collected

Introduction to Big Data 113

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election(1)

– The approach of Kosinski and his colleagues:
• First, they provided test subjects with a questionnaire in the form of an online quiz.
• From their responses, the psychologists calculated the personal Big Five values of
respondents
– openness (how open you are to new experiences?), conscientiousness (how much
of a perfectionist are you?), extroversion (how sociable are you?), agreeableness
(how considerate and cooperative you are?) and neuroticism (are you easily upset?)
– we can make a relatively accurate assessment of the kind of person in front of us. This includes
their needs and fears, and how they are likely to behave
• Kosinski's team then compared the results with all sorts of other online data
from the subjects: what they "liked," shared or posted on Facebook, or what
gender, age, place of residence they specified, for example.
• Remarkably reliable deductions could be drawn from simple online actions:
– men who "liked" the cosmetics brand MAC were slightly more likely to be gay;
– one of the best indicators for heterosexuality was "liking" Wu-Tang Clan
– Followers of Lady Gaga were most probably extroverts, while those who "liked"
philosophy tended to be introverts
• While each piece of such information is too weak to produce a reliable
prediction, when tens, hundreds, or thousands of individual data points are
combined, the resulting predictions become really accurate

Introduction to Big Data 114

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election(2)

– Kosinski and his team tirelessly refined their models.
– In 2012, Kosinski proved that on the basis of an average of 68 Facebook "likes" by
a user, it was possible to predict their skin color (with 95 percent accuracy), their
sexual orientation (88 percent accuracy), and their affiliation to the Democratic or
Republican party (85 percent).
– But it didn't stop there. Intelligence, religious affiliation, as well as alcohol, cigarette
and drug use, could all be determined.
– From the data it was even possible to deduce whether someone's parents were
divorced.
– The strength of their modeling was illustrated by how well it could predict a
subject's answers.
– Kosinski continued to work on the models incessantly: before long, he was able to
evaluate a person better than the average work colleague, merely on the basis of
ten Facebook "likes."
• Seventy "likes" were enough to outdo what a person's friends knew,
• 150 what their parents knew, and
• 300 "likes" what their partner knew.
• More "likes" could even surpass what a person thought they knew about themselves. On
the day that Kosinski published these findings, he received two phone calls. The
threat of a lawsuit and a job offer. Both from Facebook.

Introduction to Big Data 115

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election (3)

• Only weeks later Facebook "likes" became private by default.
• Before that, the default setting was that anyone on the internet
could see your "likes."
• But this was no obstacle to data collectors: while Kosinski
always asked for the consent of Facebook users, many apps
and online quizzes today require access to private data as a
precondition for taking personality tests.
• Anybody who wants to evaluate themselves based on their
Facebook "likes" can do so on Kosinski's website, and then
compare their results to those of a classic Ocean
questionnaire, like that of the Cambridge Psychometrics
Center.

Introduction to Big Data 116

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election (4)

• But it was not just about "likes" or even Facebook: Kosinski
and his team could now ascribe Big Five values based purely
on how many profile pictures a person has on Facebook, or
how many contacts they have (a good indicator of
extraversion).
• But we also reveal something about ourselves even when
we're not online.
• For example, the motion sensor on our phone reveals how
quickly we move and how far we travel (this correlates with
emotional instability).
• Our smartphone, Kosinski concluded, is a vast
psychological questionnaire that we are constantly filling
out, both consciously and unconsciously.

Introduction to Big Data 117

Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

• Vă mulțumim!

Introduction to Big Data 118

Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Data Engineering Essentials
No ratings yet
Data Engineering Essentials
36 pages
Big Data Analytics Lab Manual 2025
No ratings yet
Big Data Analytics Lab Manual 2025
91 pages
10
No ratings yet
10
4 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
91 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
28 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Programme Devops
No ratings yet
Programme Devops
25 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
HCIA-Big Data V3.0 Lab Guide
No ratings yet
HCIA-Big Data V3.0 Lab Guide
196 pages
Oracle Big Data
No ratings yet
Oracle Big Data
12 pages
HBase for Big Data Professionals
No ratings yet
HBase for Big Data Professionals
100 pages
6 Big Data Analytics Lab Manual
No ratings yet
6 Big Data Analytics Lab Manual
73 pages
HCIA-Cloud Computing V5.0 Lab Environment Setup Guide (Basic Exercises)
No ratings yet
HCIA-Cloud Computing V5.0 Lab Environment Setup Guide (Basic Exercises)
66 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
RAG Cheat Sheet-2
No ratings yet
RAG Cheat Sheet-2
29 pages
BDA GTU Study Material E-Notes All-Units 03122021014217PM
No ratings yet
BDA GTU Study Material E-Notes All-Units 03122021014217PM
42 pages
Short Guide To Prompting
No ratings yet
Short Guide To Prompting
14 pages
Course Presentation GoogleCloudDigitalLeader
No ratings yet
Course Presentation GoogleCloudDigitalLeader
182 pages
HPC Kubernetes Deployment Guide
No ratings yet
HPC Kubernetes Deployment Guide
28 pages
GCP Data Engineering Cheet Sheet
No ratings yet
GCP Data Engineering Cheet Sheet
11 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
175 pages
Big Data-1 Starting A VM
No ratings yet
Big Data-1 Starting A VM
21 pages
CEH v10 Module 08 - Sniffing PDF
No ratings yet
CEH v10 Module 08 - Sniffing PDF
37 pages
Real Time Analytics With Apache Kafka and Spark: @rahuldausa
No ratings yet
Real Time Analytics With Apache Kafka and Spark: @rahuldausa
54 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Lect1 Intro To Java
No ratings yet
Lect1 Intro To Java
58 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Cloud Computing Lab Course Guide
No ratings yet
Cloud Computing Lab Course Guide
2 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
100% (1)
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
16 pages
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
No ratings yet
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
3 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
BDA - Lab Manual
No ratings yet
BDA - Lab Manual
78 pages
Machine Learning For Networking
No ratings yet
Machine Learning For Networking
10 pages
Poo
No ratings yet
Poo
8 pages
Big Data &iot Sy
No ratings yet
Big Data &iot Sy
2 pages
Cloud Computing for IT Professionals
No ratings yet
Cloud Computing for IT Professionals
37 pages
Install Hadoop-2.6.0 On Windows10
No ratings yet
Install Hadoop-2.6.0 On Windows10
10 pages
Hadoop Setup Guide for Windows Users
No ratings yet
Hadoop Setup Guide for Windows Users
29 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Big Data - Road Map
No ratings yet
Big Data - Road Map
22 pages
DevOps Engineer Learning Path Guide
No ratings yet
DevOps Engineer Learning Path Guide
10 pages
21CSE354T - Full Stack Web Development Question Bank
100% (1)
21CSE354T - Full Stack Web Development Question Bank
9 pages
Artificial Intelligence Hand Written Notes 1731139248
No ratings yet
Artificial Intelligence Hand Written Notes 1731139248
118 pages
Introduction To Go Language Final PDF
No ratings yet
Introduction To Go Language Final PDF
220 pages
Master Machine Learning in Just 30 Days Version01
No ratings yet
Master Machine Learning in Just 30 Days Version01
25 pages
(T-GCPAWS-I) Module 1 - Introducing Google Cloud Platform
No ratings yet
(T-GCPAWS-I) Module 1 - Introducing Google Cloud Platform
36 pages
SevOne Data Insight 3.11 Pre-Installation Guide
No ratings yet
SevOne Data Insight 3.11 Pre-Installation Guide
25 pages
Scalable Machine Learning With Apache Spark en
No ratings yet
Scalable Machine Learning With Apache Spark en
145 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Data Wrangling With Python Tips and Tools To Make Your Life Easier Test Bank Available Instantly
No ratings yet
Data Wrangling With Python Tips and Tools To Make Your Life Easier Test Bank Available Instantly
407 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Bigdataaaaa
No ratings yet
Bigdataaaaa
180 pages
Big Data Analysis Introduction
No ratings yet
Big Data Analysis Introduction
42 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big Data
No ratings yet
Big Data
41 pages
Grade 7 Front Office Basics
No ratings yet
Grade 7 Front Office Basics
16 pages
Seo - 2018 - Drone-Enabled Bridge Inspection Methodology and Application
No ratings yet
Seo - 2018 - Drone-Enabled Bridge Inspection Methodology and Application
15 pages
House Republicans Land Sales Letter
No ratings yet
House Republicans Land Sales Letter
2 pages
Arduino Based Automated Waste Segregation Project
100% (1)
Arduino Based Automated Waste Segregation Project
34 pages
Completed CH6 Mini Case Pharma Biotech Working Papers Fall 2014
50% (2)
Completed CH6 Mini Case Pharma Biotech Working Papers Fall 2014
3 pages
Specifications and Repair Procedures For C4.4 Cylinder Blocks
No ratings yet
Specifications and Repair Procedures For C4.4 Cylinder Blocks
8 pages
Rose Wine Sales Forecasting Guide
No ratings yet
Rose Wine Sales Forecasting Guide
52 pages
2-Systems Classification - Tutorialspoint
No ratings yet
2-Systems Classification - Tutorialspoint
3 pages
Content of HPE ESXi Release Images
No ratings yet
Content of HPE ESXi Release Images
28 pages
AC-AC Converters - Pekik Argo Dahono
No ratings yet
AC-AC Converters - Pekik Argo Dahono
23 pages
Barangay Baybay Record System
No ratings yet
Barangay Baybay Record System
13 pages
Ingersoll Rand Zimmerman Air Balancer Information
No ratings yet
Ingersoll Rand Zimmerman Air Balancer Information
108 pages
Human Resource Development Strategies
No ratings yet
Human Resource Development Strategies
10 pages
SNT Autopart Oil Seal Catalog For MITSUBISHI FUSO PDF
No ratings yet
SNT Autopart Oil Seal Catalog For MITSUBISHI FUSO PDF
151 pages
UTM Software Guidance Rev 1.0
No ratings yet
UTM Software Guidance Rev 1.0
29 pages
16-000268-3, Agr Lite Frame
No ratings yet
16-000268-3, Agr Lite Frame
4 pages
Rubric For Individual Assignment
No ratings yet
Rubric For Individual Assignment
3 pages
Ai Agent For Puc
No ratings yet
Ai Agent For Puc
4 pages
PML Heat Exchange
No ratings yet
PML Heat Exchange
64 pages
Roadstar CTN Chassis 1.1 TV SM
No ratings yet
Roadstar CTN Chassis 1.1 TV SM
33 pages
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
No ratings yet
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
13 pages
Fiscal Management
100% (2)
Fiscal Management
19 pages
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
No ratings yet
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
20 pages
Gt06 Protocol
No ratings yet
Gt06 Protocol
20 pages
Water & Sediment in Crude Oil Test
100% (1)
Water & Sediment in Crude Oil Test
5 pages
Car - Pla 043
No ratings yet
Car - Pla 043
2 pages
BT 406 M.C.Q File by Amaan Khan
No ratings yet
BT 406 M.C.Q File by Amaan Khan
37 pages
07 Covariance Answers Hidden Lecture
No ratings yet
07 Covariance Answers Hidden Lecture
62 pages
Debt Reduction Calculator Excel Template
No ratings yet
Debt Reduction Calculator Excel Template
5 pages
BAOU Ahmedabad 2024 Recruitment
No ratings yet
BAOU Ahmedabad 2024 Recruitment
3 pages