Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Introduction to Big Data
2021-2022
Prof.dr.ing. Florin Pop
florin.pop@cs.pub.ro
Dr.ing. Cătălin Negru
catalin.negru@cs.pub.ro
Dr.ing. Bogdan Mocanu
mocanu.bogdan.costel@gmail.ro
SCPD • SSA • ABD • SEM
Introduction to Big Data 1
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
The Human Face of Big Data
• Big Data will impact every part of your life: https://www.youtube.com/watch?v=0Q3sRSUYmys
• What Exactly Is Big Data? https://www.forbes.com/video/4857597029001
Introduction to Big Data 2
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Gartner Tempers The Expectations Of Big Data
Introduction to Big Data 3
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
The 2017 Big Data Landscape
Introduction to Big Data 4
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Data & AI Landscape 2020
Introduction to Big Data 5
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Understanding Big Data
2.5 quintillion bytes of data each day
Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly
used tools to capture, process, store, manage and analyze them can be termed as BIG DATA.
Introduction to Big Data 6
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Complex situations?
• Data distribution
– The large data set is split into chunks or smaller blocks and
distributed over N number of nodes or machines.
– Distributed File System.
• Parallel processing
– The distributed data gets the power of N number of servers and
machines in which data is residing and works in parallel for the
processing and analysis
– MapReduce (Google), Hadoop, Spark, Flink, etc.
• Fault tolerance
– We keep the replica of a single block (or chunk) of data more than
once.
• Commodity hardware
– We don’t need specialized hardware with special RAID as Data
container.
• Flexibility and Scalability
– add more and more of rack space into the cluster as the demand for
space increases.
Introduction to Big Data 7
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
A brief history of Data Science
Introduction to Big Data 8
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Big Data Timeline v2.0
Illustration by Héizel Vázquez
Introduction to Big Data 9
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Big Data Research Challenges
• Heterogeneity and incompleteness
– machine analysis algorithms expect homogeneous data;
– even after data cleaning and error correction, some incompleteness and some errors in data
are likely to remain.
• Scale
– managing large and rapidly increasing volumes of data;
– clock speeds have largely stalled and processors are being built with increasing numbers of
cores;
– parallelism across nodes in a cluster;
– move towards cloud computing;
– large clusters requires new ways of determining how to run and execute data processing jobs.
• Timeliness
– acquisition rate challenge and a timeliness challenge;
– many situations in which the result of the analysis is required immediately;
– ex: a full analysis of a user’s purchase history is not likely to be feasible in real-time.
• Privacy
– important to rethink security for information sharing in Big Data use cases;
– we do not understand what it means to share data.
• Human collaboration
– analytics for Big Data will be designed to have a human in the loop;
– crowd-sourcing;
– issues of uncertainty and error => participatory-sensing;
– the extra challenge here is the inherent uncertainty of the data collection devices.
Introduction to Big Data 10
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Major Benefits of Big Data
• Answer more questions, more completely (powerful big data
business intelligence platform)
• Faster and better decision making
• Become confident in your accurate data (solving the
problem of inaccurate, incomplete view of the data)
• Empower a new generation of employees (data scientists)
• Cost reduction and revenue generation
Introduction to Big Data 11
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Big Remark
We must to support and we need to encourage
fundamental research towards addressing all scientific and
technical challenges
if we want to achieve the promised benefits of
Big Data.
Introduction to Big Data 12
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Introduction in Big Data.
Distributed platforms
Brief introduction
Datacenters, Cloud computing, HPC, Fog and
Edge computing; Mobile computing; Vanet (V2X).
Distributed Applications (practical use cases):
Smart Cities; eHealth; eGovernment; Scientific
Applications
Introduction to Big Data 13
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Datacenters
• Processing Big Data, the Google Way
Photo: GOOGLE Photo: GOOGLE
Twin banks of servers at the data center in Douglas County, Ga. Server racks at the data center in Mayes County, Okla.. Each server
use energy-thrifty blue LEDs as monitor lights. has four switches connected by a different colored cable. Google
keeps the colors the same throughout our data center so it knows
which one to replace in case of failure.
Introduction to Big Data 14
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Cloud Computing
• Definition by IBM:
– Cloud computing, often referred to as simply “the cloud,” is the delivery
of on-demand computing resources—everything from applications to data
centers—over the internet on a pay-for-use basis.
• Cloud Computing Services:
– Software as a Service (SaaS):
• applications are supplied by the cloud provider;
• applications are accessible from various client devices (e.g. web browser, API);
• Examples: GMAIL, Dropbox, Office 365, Amazon Web Services
– Platform as a Service (PaaS)
• deploy applications onto the Cloud infrastructure
• programming languages, tools and libraries are supported by the provider;
• control over the deployed applications
– Infrastructure as a Service (IaaS)
• provision with fundamental computing resources: VM’s (CPU, storage, network);
• resources are distributed and support dynamic scaling;
• deploy and run arbitrary software, meaning middleware, and operating systems;
• control over operating systems, storage, and deployed applications
• User does not manage or control the underlying Cloud infrastructure.
Introduction to Big Data 15
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Cloud Computing - Deployment models
• Private
– Single-tenant architecture
– On-premises hardware
– Direct control of infrastruscure
– Ex: IBM, Red Had, Microsoft, OpenStack,
ownCloud
• Public
– Multi-tenant architecture
– Pay-as-you-go pricing model
– Ex: AWS, Microsoft Azure, Google Cloud
Platform
• Hybrid
– Cloud bursting capabilities
– Benefit of both public and private environments
– Ex: combination of both public and private
providers
Introduction to Big Data 16
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HPC
• High Performance Computing most generally refers to the
practice of aggregating computing power in a way that
delivers much higher performance than one could get out
of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business.
Introduction to Big Data 17
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Fog and Edge Computing
Introduction to Big Data 18
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Mobile computing
Introduction to Big Data 19
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Vanet (V2X)-Vehicular ad hoc networks
• Vehicle-to-Vehicle (V2V) among vehicles
– A wireless ad hoc network on the roads, suited for short range vehicular networks;
• Vehicle-to-Infrastructure (V2I) between vehicles and Road-side-Units
– Long range vehicular networks;
• Vehicle-to-X (V2X) mixed approach
– Vehicle-to-everything
Introduction to Big Data 20
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Data Science and the Values of
Big Data
The 8V’s: Volume, Velocity, Variety, Variability,
Visualization, Value, Veracity, Vicissitude;
Use cases and current challenges
Introduction to Big Data 21
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Starting V
Introduction to Big Data 22
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Volume
• Big Data Volume refers to the large amounts of generated
data:
– 40 ZettaBytes by 2020;
• 300 times more than 2005;
– ~ 2.5 Quintillion (2.3 Trillions GB) bytes data created daily;
– 100 Terabytes per company stored;
– 6 billion cell phones;
– 7 billon peoples.
• Volume challenges:
– Storage, analysis and processing;
– Even the smallest optimization represents a step ahead in getting
the meaningful data in time and at a low cost;
Introduction to Big Data 23
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Velocity
• Velocity refers to the speed of generating data (the speed at
which the data is flowing):
– New York Stock Exchange captures 1TB of trade information
during each trading session;
– Modern cars have about 100 sensors;
– ~18.9 billion network connections
• about 2.5 connections per person;
– Social media messages are
spread in seconds;
• Velocity challenges:
– Response time of delivery services (for financial markets is real-
time);
– Speed of manipulating and analyzing complex data;
– The Big Data system must be able to adjust and deal with the high
loads of data at peak times;
Introduction to Big Data 24
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Variety
• Variety refers to the diversity of data:
– 150 Exabytes in healthcare (2011);
– ~30 billion pieces of contend shared on
Facebook;
– By 2014 ~420 mil. wearable, wireless
health monitors;
– 4 billion hours of videos watched on YouTube per month;
– 400 million tweets send /day by 200 mil. active users;
• Variety challenges:
– Big Data systems should handle complex data: relational data, raw data,
semi-structured or unstructured data;
– Almost 80% of the world’s data is now unorganized;
• traditional DB can’t be used anymore for storage and management
– different types of data must be processed and mapped to dedicated
resources;
Introduction to Big Data 25
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Variability
• Variability refers to data whose
meaning is constantly changing;
• For example words don’t have static
definitions and meaning vary in context:
– “Delicious muesli from the
@imaginarycafe- what a great
way to start the day!”
– “Greatly disappointed that my
local Imaginary Cafe have
stopped stocking BLTs.”
– “Had to wait in line for 45
minutes at the Imaginary Cafe
today. Great, well there’s my
lunchbreak gone…”
– “great” on its own is not a sufficient
signifier of positive sentiment;
• Variability challenges:
– ‘Understand’ context and decode the
precise meaning of words
Introduction to Big Data 26
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Visualization
• Visualization refers to presentation of data in a pictorial or
graphical format:
– Analytics presented visually:
• to understand difficult concepts;
• to identify new patters;
– Interactive visualization:
• charts;
• graphs;
• ex: https://d3js.org/, https://developers.google.com/chart/ ,
http://www.fusioncharts.com/,
• Visualization challenges:
– Develop new techniques for data visualization;
– Data exploration;
– Inconsistencies in the data structure;
Introduction to Big Data 27
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Value
• Value refers to the value that can be extracted form datasets:
– it does not matter how big the data volume is or how complex,
– it is unless we are able to extract the meaningful information;
– storing and processing meaningless data represents:
• a waste of money, time,
and obtaining the relevant information becomes harder;
– Traditional approach to Big Data: systems for storing big data, but getting
insights can be difficult:
– time-intensive, manual query and analysis;
– modern dashboarding tools;
– Modern approach to Big Data:
• Machine brains: specialized algorithms that :
– can learn to identify patterns,
– correlations and
– indicators through training and repetition;
• Big Data Value challenges:
– developing tools for getting value form Big Data;
Introduction to Big Data 28
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Veracity
• Veracity refers to the trustworthiness of the data (uncertainty of data):
– 1 in 3 business leaders don’t trust the information they use to make
decisions;
– 27% of respondents in one survey were unsure how much of their data
was inaccurate;
– Poor data quality cost US economy about $1.3 trillion/year;
• Causes of veracity:
– Rumors, spammers, collection errors,
entry errors, system errors;
• Veracity challenges:
– the volume and complexity of data
are increasing;
– quality and accuracy are becoming
less controllable;
Introduction to Big Data 29
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Vicissitude
• Vicissitude property refers to the challenge of scaling Big Data
complex workflows;
BTWorld: A Large-scale Experiment in Time-Based
Analytics
Introduction to Big Data 30
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
+ more Vs of Big Data
+ Volatility
+ Veridicity
+ Vision
+…
Introduction to Big Data 31
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
10 Cool Big Data Projects
• #1. To find exactly what we look for in the internet
• #2. To ride through a city without traffic jams
• #3. To save rare animals, catching poachers
• #4. To make our cities green
• #5. To understand why Indian cuisine is unique
• #6. To fight malaria epidemics in Africa
• #7. To grow ideal Christmas trees
• #8. To understand that our languages are filled with happiness
• #9. To make sport shows even more interesting
• #10. To improve job conditions
• #11. To enhance relationship
https://www.kaspersky.com/blog/cool-big-data-projects/8186/
Introduction to Big Data 32
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Distributed Model for Scalable Batch
Computing
Google MapReduce. MapReduce Patterns,
Algorithms and use cases
Introduction to Big Data 33
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Philosophy to Scale for Big Data
Divide Work
Combine Results
Introduction to Big Data 34
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Distributed processing is non-trivial
• How to assign tasks to different workers in an efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated to different
workers?
• Big Data storage is challenging
– Data Volumes are massive
– Reliability of Storing PBs of data is challenging
– All kinds of failures: Disk/Hardware/Network Failures
– Probability of failures simply increase with the number of
machines …
Introduction to Big Data 35
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Google MapReduce
• MapReduce is:
– a programming model;
– implementation for processing and generating big data sets;
• Released in 2004 by Google (Jeffrey Dean and Sanjay
Ghemawat)
• Perform simple computations while hiding the details of:
– Parallelization;
Large Dataset
R
Split data
– Data distribution; E
Split data M
D
– Load balancing; A Results
Split data U
… P
• I/O scheduling; C
Split data E
– Fault tolerance;
• Distribute data over multiple nodes
– Each node receive a small portion of data
Introduction to Big Data 36
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Google MapReduce (2)
• MapReduce programming model:
– Input & Output: each a set of key/value pairs
• Programmer must specifies two functions:
– map (in_key, in_value) à list(out_key, intermediate_value)
• Processes input key/value pair;
• Produces set of intermediate pairs;
– reduce (out_key, list(intermediate_value)) à list(out_value)
• Combines all intermediate values for a particular key;
• Produces a set of merged output values;
reduce(String output_key, Iterator
map(String input_key, String intermediate_values):
input_value): // output_key: a word
// input_key: document name // output_values: a list of counts int
// input_value: document contents f result = 0;
for each word w in input_value: for each v in intermediate_values:
EmitIntermediate(w, "1"); result += ParseInt(v);
Emit(AsString(result));
Introduction to Big Data 37
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MapReduce Architecture
Introduction to Big Data 38
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MapReduce Data Flow
Introduction to Big Data 39
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MapReduce Patterns
• Counting and Summing - log analysis, data querying:
– Calculate a total number of occurrences of each word in a large set of
documents;
– Process a log file :
• each record contains a response time;
• it is required to calculate an average response time.
• Collating – ETL, building of inverted indexes:
– Save all items that have the same value of a function into one file
• Filtering (“Grepping”), Parsing, and Validation - log analysis, data
querying, ETL, data validation:
– Collect all records that meet some condition;
– Transform each record (independently from other records) into another
representation
Introduction to Big Data 40
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MapReduce Patterns (2)
• Distributed Task Execution – physics and engineering simulations,
numerical analysis, performance testing:
– Large computational divided into multiple parts;
– Results from all parts are combined together to obtain a final result.
• Sorting - ETL, data analysis:
– Sort large set of records by some rule;
– Process records in a certain order.
• Iterative Message Passing (Graph Processing) - graph analysis,
web indexing:
– Calculate a state of each entity in a network on the basis of properties of
the other entities in its neighborhood;
– State can represent :
• distance to other nodes;
• indication that there is a neighbor with the certain properties;
• characteristic of neighborhood density ;
• etc.
Introduction to Big Data 41
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MapReduce Patterns (3)
• Distinct Values (Unique Items Counting) – log analysis, unique users
counting
– Set of records that contain fields F and G;
– Count the total number of unique values of field F for each subset of records that
have the same G (grouped by G).
• Cross-Correlation -text analysis, market analysis:
– A set of tuples of items ;
– Calculate a number of tuples where some items co-occur for each possible pair.
• Relational MapReduce Patterns:
– Selection
– Projection
– Union
– Intersection
– Difference
– GroupBy and Aggregation
– Joining
Introduction to Big Data 42
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Algorithms and use cases
• PageRank and Mapper-Side Data Aggregation
– Used by Google to calculate relevance of a web page as a function of
authoritativeness (PageRank) of pages that have links to this page.
– the algorithm is complex
– its core it is just a propagation of weights between nodes
• each node calculates its weight as a mean of the incoming weights:
class N
State is PageRank
method getMessage(object N)
return N.State / N.OutgoingRelations.size()
method calculateState(state s, data [d1, d2,…])
return ( sum([d1, d2,...]) )
class Mapper class Reducer
method Initialize method Reduce(id m, [s1, s2,...])
H = new AssociativeArray
method Map(id n, object N)
M = null
p = N.PageRank / p = 0
N.OutgoingRelations.size() for all s in [s1, s2,...] do
Emit(id n, object N) if IsObject(s) then
for all id m in N.OutgoingRelations M = s
do else
H{m} = H{m} + p
method Close p = p + s
for all id n in H do M.PageRank = p
Emit(id n, value H{n}) Emit(id m, item M)
Introduction to Big Data 43
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Word Count Example
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
Owen O’Malley (Yahoo!)
Introduction to Big Data 44
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Introduction to Big Data 45
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable>
values, OutputCollector<Text,IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Introduction to Big Data 46
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
Introduction to Big Data 47
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Hadoop
Programming environment and use cases
* Based on slides by Dhruba Borthakur, Owen O’Malley (Yahoo!)
Introduction to Big Data 48
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
The history of an elephant
• Hadoop is an open-source implementation based on Google
File System (GFS) and MapReduce from Google
• Hadoop was created by Doug Cutting and Mike Cafarella in
2005
• Hadoop is an Apache open source project from 2006
• Why Hadoop?
– Need to process huge datasets on large clusters of computers
– Very expensive to build reliability into each application
– Nodes fail every day
• Failure is expected, rather than exceptional
• The number of nodes in a cluster is not constant
– Need a common infrastructure
• Efficient, reliable, easy to use
Introduction to Big Data 49
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Hadoop Basic Modules
Computation
Storage
Introduction to Big Data 50
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple Data Nodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from Data Node
Introduction to Big Data 51
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS)
paper, 2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files
– Random access to files
– Low latency access
Introduction to Big Data 52
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Design of HDFS
• HDFS files are divided into blocks
• Master-Slave design
– It’s the basic unit of read/write
• Master Node – Default size is 64MB, could be larger (128MB)
– Single NameNode for managing metadata – Hence makes HDFS good for storing larger
files
• Slave Nodes • HDFS blocks are replicated multiple times
– Multiple DataNodes for storing data – One block stored at multiple location, also at
• Other different racks (usually 3 times)
– This makes HDFS storage fault tolerant and
– Secondary NameNode as a backup faster to read
Secondary
Client NameNode
NameNode
DataNode DataNode DataNode DataNode NameNode keeps the
metadata, the name,
location and directory
DataNode DataNode DataNode DataNode DataNode provide storage
for blocks of data
Heartbeat, Commands, Data
Introduction to Big Data 53
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HDFS Architecture
Introduction to Big Data 54
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HDFS Inside: Blocks
• Reasons:
– File can be larger than a single disk
– Block is of fixed size, easy to manage and manipulate
– Easy to replicate and do more fine grained load balancing
• HDFS Block size is by default 64 MB (larger than regular file
system block)
– Minimize overhead: disk seek time is almost constant
– Example: seek time: 10 msec, file transfer rate: 100MB/s,
overhead (seek time/a block transfer time) is 1%, what is the block
size?
– 100 MB (HDFS -> 128 MB)
Introduction to Big Data 55
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc.
Introduction to Big Data 56
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Introduction to Big Data 57
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Block Placement and Heartbeats
• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
• DataNodes send hearbeat to the NameNode
– Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure
Introduction to Big Data 58
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Fault tolerance support in HDFS
• Replication of blocks for fault tolerance
File B1 B2 B3 B4
Node Node Node Node
B1 B2 B4 B3
Node Node Node
B1 Node
B3 B1 B2 B4
Node Node Node Node
B4 B3 B1 B2
Introduction to Big Data 59
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HDFS Inside: Name Node
Snapshot of FS Edit log: record
Name Node
changes to FS
Filename Replication factor Block ID
File 1 3 [1, 2, 3]
File 2 2 [4, 5, 6]
File 3 1 [7,8]
Data Nodes
1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6
Introduction to Big Data 60
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HDFS Inside: Read
1
Name Node
Client
2
• Prevent NN from being
3 4
the bottleneck of the
cluster
... • Allow HDFS to scale to
DN1 DN2 DN3 DNn
large number of
concurrent clients
• Spread the data traffic
across the cluster
1. Client connects to NN to read data
2. NN tells client where to find the data blocks
3. Client reads blocks directly from data nodes (without going through NN)
4. In case of node failures, client connects to another node that serves the
missing block
Introduction to Big Data 61
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
HDFS Inside: Write
1
Name Node
Tradeoffs: Reliability,
Write Bandwidth and
Client
Read Bandwidth
2
4
DN1 DN2 DN3 ... DNn
1. Client connects to NN to write data
2. NN tells client write these data nodes
3. Client writes blocks directly to data nodes with desired replication factor
4. In case of node failures, NN will figure it out and replicate the missing blocks
Introduction to Big Data 62
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Example of HDFS shell commands
Create a directory in HDFS
• hadoop fs -mkdir /user/bigdata/dir1
List the content of a directory
• hadoop fs -ls /user/bigdata
Upload and download a file in HDFS
• hadoop fs -put /home/bigdata/file.txt /user/bigdata/datadir/
• hadoop fs -get /user/bigdata/datadir/file.txt /home/
Look at the content of a file
• hadoop fs -cat /user/bigdata/datadir/book.txt
Many more commands, similar to Unix
Introduction to Big Data 63
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more
Hadoop Cluster at Yahoo! (Credit: Yahoo)
Introduction to Big Data 64
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
A real world example of New York Times
• Goal: Make entire archive of articles available online: 11
million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud (EC2) and
Simple Storage System (S3)
• Time: < 24 hours
• Costs: ~ $240
Introduction to Big Data 65
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Hadoop Ecosystem
• Pig: High-level language for data analysis (Yahoo! Research)
• Hive: SQL-like Query language and Metastore (Facebook)
• HBase: Table storage for semi-structured data (Modeled on
Google’s Bigtable)
• Zookeeper: Coordinating distributed applications
• Mahout: Machine learning
Introduction to Big Data 66
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
In-Memory Cluster Computing
Apache Spark
Introduction to Big Data 67
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Overview of LAMBDA Architecture
• The Lambda Architecture has three major components:
– Batch layer provides the following functionality
• managing the master dataset, an immutable, append-only set of raw data
• pre-computing arbitrary query functions, called batch views.
– Serving layer
• indexes the batch views so that they
can be queried in ad hoc with low latency.
– Speed layer
• accommodates all requests that are
subject to low latency requirements.
• use fast and incremental algorithms
• deals with recent data only.
Introduction to Big Data 68
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Introduction to Big Data 69
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Spark
• Spark is a general-purpose computing framework for iterative
tasks
• API is provided for Java, Scala and Python
• The model is based on MapReduce
– enhanced with new operations and an engine that supports
execution graphs
• Tools include Spark SQL, MLLlib for machine learning, GraphX
for graph processing and Spark Streaming
• Unifies batch, streaming,
interactive computing
• Making it easy to build
sophisticated applications
Introduction to Big Data 70
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Key idea in Spark
• Resilient distributed datasets (RDDs)
– Immutable collections of objects across a cluster
– Built with parallel transformations (map, filter, …)
– Automatically rebuilt when failure is detected
– Allow persistence to be controlled (in-memory operation)
• Transformations on RDDs
– Lazy operations to build RDDs from other RDDs
– Always creates a new RDD
• Actions on RDDs
– Count, collect, save
Introduction to Big Data 71
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Spark overview
Introduction to Big Data 72
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
YARN
Introduction to Big Data 73
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MESOS Architecture
Apache Mesos is an open source cluster manager that handles workloads in a
distributed environment through dynamic resource sharing and isolation.
Introduction to Big Data 74
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Task Scheduler
• Supports general task graphs
• Pipelines functions where possible
• Cache-aware data reuse & locality
• Partitioning-aware to avoid shuffles
• MESOS provides resource
allocation (offer resources to
framework, accept/reject by
framework scheduler)
Introduction to Big Data 75
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Implementing Spark Algorithms
• Broadcast everything
– Master broadcasts data and initial models
– At each iteration updated models are broadcast by master (driver
program)
– Does not scale well due to communication overhead
• Data parallel
– Worker loads data
– Master broadcasts initial models
– At each iteration updated models are broadcast by master
– Works for large datasets, because data is available to workers
• Fully parallel
– Workers load data and they instantiate the models
– At each iteration, models are shared via join between workers
– Much better scalability
Introduction to Big Data 76
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Machine Learning and Spark
• Spark RDDs support efficient data sharing
• In-memory caching increases performance
– Reported to have performance of up to 100 times faster than
Hadoop in memory or 10 times faster on disk
• High-level programming interface for complex algorithms
Introduction to Big Data 77
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
MLBase and MLlib
• MLBase has been designed for simplifying the development of
machine learning pipelines:
– MLlib is a machine learning library
– MLI (ML Developer API) is an API for machine learning
development that aims to abstract low-level details from the
developers
– MLOpt is a declarative layer that aims to automate the machine
learning pipeline
• The idea is that the system searches feature extractors and models
best fit for the ML task
Introduction to Big Data 78
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Graph-Parallel Systems
• Graph-based computation depends only on the neighbors of a
particular vertex
– “Think like a Vertex.” – Pregel (SIGMOD 2010)
• Systems with specialized APIs to simplify graph processing
– Pregel from Google
• Push abstraction: Vertex programs interact by sending messages
• Receive msgs, process, send msgs
• GraphLab
– Pull abstraction: Vertex programs access adjacent vertices and
edges
– Foreach (j in neighbours) calculate pagerank total for j
Introduction to Big Data 79
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
GraphX
• Separation of system support for each view (table, graph) involves
expensive data movement and duplication
• GraphX makes tables and graphs views of the same physical data
• The views have
their own optimized
semantics
– Table operators
inherited from Spark
– Graph operators
form relational algebra
Introduction to Big Data 80
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Performance Gains for PageRank
Revisited
• Spark is reported to be 4x faster than Hadoop
• Graphlab is 16x faster than Spark
• GraphX is roughly 3x slower than Graphlab
• GraphX is reported to compare favourably to Graphlab with
pipelines (raw -> hyperlink -> pagerank -> top 20)
• Graph structure can be exploited for significant performance
gains
Introduction to Big Data 81
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Spark Streaming
• Spark extension of accepting and processing of streaming high-
throughput live data streams
• Data is accepted from various sources
– Kafka, Flume, TCP sockets, Twitter, …
• Machine learning algorithms and graph processing algorithms can be
applied for the streams
• Similar systems
– Twitter (Storm),
Google (MillWheel),
Yahoo! (S4)
Introduction to Big Data 82
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Stream Processing: Discretized
• Streaming computation: a series of very small deterministic batch jobs
– Live stream is divided into batches of x seconds
– Each batch of data is an RDD and RDD operations can be used
– Results are also returned in batches
– Batch size as low as 0.5 seconds, results in approx. One second latency
• Can combine streaming and batch processing
Click stream processing example
Event vs. Processing Time
• There’s a difference between
even time (te) and processing
time (tp)
• Events arrive out-of order even
during normal operation
• Events may arrive arbitrary late
• Apply a grace period before
processing events
• Allow arbitrary update windows of
metrics
Introduction to Big Data 83
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
K – Architecture
• A distributed data operating system is emerging
– Supported by YARN and MESOS
• Various data services on top of this (Hadoop and Spark)
• Some services are being integrated (for example in Spark) for better
coherence and performance
• Important points
– Data format (row/column, block size)
– Network topology and data/code placement
– Algorithm structure and coordination
– Scheduling and resource management
• Big Data Frameworks are evolving
– Spark represents unification of streaming, machine learning and graphs
• Big Data pipeline management is at an early stage
– How to achieve better mapping between cluster resources, scheduling, and
pipelines?
Introduction to Big Data 84
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Frameworks for real-time data stream
analytics
Apache Storm
Introduction to Big Data 85
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Storm Architecture
• Nimbus: Supervisor
– Distributes codes around cluster
– Assigns tasks to machines supervisors ZooKeeper
Supervisor
– Failure monitoring
– Is fail-fast and stateless Nimbus ZooKeeper
Supervisor
• Supervisor:
– Listens for work assigned to its machine
ZooKeeper
– Starts and stops worker processes Supervisor
– Shuts down worker processes
• Topology: process messages forever
Supervisor
– A running topology consists of many worker processes spread across
many machines (graph of computation)
• Zookeper: Supervisor
– Handle cluster coordination
– Monitor working node status
– helps the Supervisor to interact with the Nimbus
Introduction to Big Data 86
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Storm
• Simple programming model
– Topology - computation graph that run forever
• node in a topology contains processing logic
• links between nodes indicate how data should be passed around
between nodes
– Spouts - source of streams
– Bolts - consumes input streams
• Programming language agnostic
– ( Clojure, Java, Ruby, Python default )
• Fault-tolerant
• Horizontally scalable
– Ex: 1,000,000 messages per second on a 10 node cluster
• Guaranteed message processing
• Fast : Uses zeromq message queue
• Local Mode : Easy unit testing
Introduction to Big Data 87
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Stream and Spouts
• Stream
– Unbounded sequence of tuples ( storm data model ) • <key,
value(s)> pair ex. <“UIUC”, 5>
• Spouts
– Source of streams : Twitterhose API
– Stream of tweets or some crawler
– have interfaces that you implement to run application-specific
logic
Introduction to Big Data 88
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Bolts
• Bolts:
– Process (one or more) input stream and produce new streams
• Functions
– Filter, Join, Apply/Transform etc
– Parallelize to make it fast! – multiple processes constitute a
bolt
Introduction to Big Data 89
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Topology & Grouping
• Topology:
– Graph of computation – can have cycles
– Network of Spouts and Bolts
– Spouts and bolts execute as many tasks across the
cluster
• Grouping
– How to send tuples between the components / tasks?
• Shuffle Grouping
– Distribute streams “randomly” to bolt’s tasks
• Fields Grouping
– Group a stream by a subset of its fields
• All Grouping
– All tasks of bolt receive all input tuples useful for joins
• Global Grouping
– entire stream goes to a single one of the bolt's
tasks
– Pick task with lowers id
Introduction to Big Data 90
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Zookeeper
• Open source server for highly reliable distributed coordination.
• As a replicated synchronization service with eventual
consistency;
• Features:
– Robust
– Persistent data replicated across multiple nodes
– Master node for writes
– Concurrent reads
– Comprises a tree of znodes, - entities roughly representing file
system nodes.
– Use only for saving small configuration data.
Introduction to Big Data 91
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Message Processing
• Guranteed Message Processing:
– Message is "fully processed" when the tuple tree has been
exhausted and every message in the tree has been processed
– A tuple is considered failed when its tree of messages fails to be
fully processed within a specified timeout.
Introduction to Big Data 92
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Fault Tolerance APIS
• Emit(tuple, output)
– Emits an output tuple, perhaps anchored on an input tuple (first
argument)
• Ack(tuple)
– Acknowledge that you (bolt) finished processing a tuple
• Fail(tuple)
– Immediately fail the spout tuple at the root of tuple topology I there
is an exception from the database, etc.
• Must remember to ack/fail each tuple
– Each tuple consumes memory. Failure to do so results in memory
leaks.
Introduction to Big Data 93
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Fault Tolerance
• Anchoring
– Specify link in the tuple tree.
• ( anchor an output to one or more input tuples.)
– At the time of emitting new tuple
– Replay one or more tuples.
• “acker” tasks
– Track DAG of tuples for every spout
– Every tuple ( spout/bolt ) given a random 64 bit id
– Every tuple knows the ids of all spout tuples for which it exits.
• How?
– Every individual tuple must be acked.
– If not task will run out of memory!
– Filter Bolts ack at the end of execution
– Join/Aggregation bolts use multi ack .
Introduction to Big Data 94
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Failure Handling
• A tuple isn't acked because the task died: Spout tuple ids at the root of the
trees for the failed tuple will time out and be replayed.
• Acker task dies: All the spout tuples the acker was tracking will time out and be
replayed.
• Spout task dies: The source that the spout talks to is responsible for replaying the
messages.
• For example, queues like Kestrel and RabbitMQ will place all pending messages back on
the queue when a client disconnects.
• Major breakthrough : Tracking algorithm
– Storm uses mod hashing to map a spout tuple id to an acker task.
• Acker task: Stores a map from a spout tuple id to a pair of values.
• Task id that created the spout tuple
• Second value is 64bit number : Ack Val
• XOR all tuple ids that have been created/acked in the tree.
– Tuple tree completed when Ack Val = 0
• Configuring Reliability
– Config.TOPOLOGY_ACKERS to 0.
– you can emit them as unanchored tuples
Introduction to Big Data 95
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Failure Scenario
• Lets take a scenario:
– Count aggregation of your stream
– Store running count in database. Increment count after processing tuple.
– Failure!
• Design:
– Tuples are processed as small batches.
– Each batch of tuples is given a unique id called the "transaction id" (txid).
– If the batch is replayed, it is given the exact same txid.
– State updates are ordered among batches.
• Example: man => [count=3, txid=1]
– Processing txid = 3 dog => [count=4, txid=3]
– Database state apple => [count=10,
txid=2]
• If they're the same : SKIP man => [count=5,
["man"] ( Strong Ordering ) txid=3]
["man"] • If they're different, dog => [count=4, txid=3]
["dog"] you increment the count. apple => [count=10,
txid=2]
Introduction to Big Data 96
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Storm Future Improvements
• Lax security policies
• Performance and scalability improvements
– Presently with just 20 nodes SLAs that require processing more
than a million records per second is achieved.
• High Availability (HA) Nimbus
• Though presently not a single point of failure, it does affect
degrade functionality.
• Enhanced tooling and language support
Introduction to Big Data 97
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Machine Learning for Big Data
The Flink model
Based on
http://linc.ucy.ac.cy/file/Talks/talks/DeepAnalysisw
ithApacheFlink_2nd_cloud_workshop.pdf
Introduction to Big Data 98
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Data Processing and Machine learning Methods
• Data processing (third trend)
– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
Processing
– Tools for processing of streaming, ETL
multimedia & batch data (extract,
• Machine Learning (fourth trend) transform,
load)
– Classification
– Regression
Machine
– Clustering Big Datasets
Learning
– Collaborative filtering
Working at the Intersection of these
four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing
Introduction to Big Data 99
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Apache Flink
• Massive parallel data flow engine with unified batch- and stream-
processing
– Batch and Stream APIs on top of a streaming engine
• Performance and ease of use
– Exploits in-memory processing and pipelining, language-embedded
logical APIs
• A runtime that "just works" without tuning
– custom memory management inside the JVM
• Predictable and dependable execution
– Bird’s-eye view of what runs and how, and what failed and why
Introduction to Big Data 100
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Flink System Stack
• Rich set of operators
§ Map, Reduce, Join,
CoGroup, Union, Iterate,
Delta Iterate, Filter, FlatMap,
GroupReduce, Project,
Aggregate, Distinct, Vertex-
Update, Accumulators
Introduction to Big Data 101
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Flink K-Means Clustering
• Cluster analysis in data mining
• Partitions n observations into k clusters
• Assign points to cluster with smallest (euclidian) distance
• Input:
– Set of observations X= {x1, x2, …, xn}
– Number of clusters K
– Convergence criteria K
• Result:
– Set of clusters C= {c1, c2, …, ck}
Init:select k data points as cluster centroids C= {c1, c2, …, ck}
Compute:
do
foreach x in X
assagin x to cluster with closest centroid
recompute centroido feach cluster
while ξi s not reached (or fixed number of iterations)
16
Introduction to Big Data 102
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
K-Means in Flink
// initialize
// points: n observations, centroids: initial k centroids
valcntrds= centroids.iterate(10) { currCntrds=>
valnewCntrds= points
.map(findNearestCntrd).withBroadcastSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 +y._2, x._3 +y._3) )
.map( x =>Centroid(x._1, x._2 /x._3) )
newCntrds }
{1, 3, 5, 7}.reduce { (x, y) =>x +y }
Introduction to Big Data 103
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Flink machine learning library
valfeatureExtractor=HashingFT()
valfactorizer=ALS()
valpipeline =featureExtractor.chain(factorizer)
valclickstreamDS=
env.readCsvFile[(String, String, Int)](clickStreamData)
valparameters =ParameterMap()
.add(HashingFT.NumFeatures, 1000000)
.add(ALS.Iterations, 10)
.add(ALS.NumFactors, 50)
.add(ALS.Lambda, 1.5)
valfactorization =pipeline.fit(clickstreamDS, parameters)
Currently available algorithms: Classification, Logistic Regression, Clustering,
Recommendation (ALS- alering least squares)
Introduction to Big Data 104
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Ethical Issues on Big Data
Rules and Regulations, Data Economy, Business
Models
Introduction to Big Data 105
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Big Data Ethics
• “Big Data” Revolution comparable with Industrial Revolution
– all kinds of human activities and decisions are beginning to be
influenced by big data predictions
• dating,
• shopping,
• medicine,
• education,
• voting,
• law enforcement, terrorism prevention, and cybersecurity
• Privacy
– Privacy as Information Rules
– Shared Private Information Can Remain Confidential
– Transparency
• Identity
Introduction to Big Data 106
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Big Data Ethics Principles
• Beneficial
– it should deliver value to all concerned parties
• Progressive
– deliver better and more valuable results
– minimize data usage to promote more sustainable and less risky
analysis
• Sustainable
– provide value that is sustainable over a reasonable time frame
• Data sustainability
• Algorithmic sustainability
• Device- and/or manufacturer-based sustainability
• Respectful
• Fair
Introduction to Big Data 107
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Key issues for Etics in Big Data
• Privacy
• Transparency
• Confidentiality
• Identity protection
Introduction to Big Data 108
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Introduction to Big Data 109
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
GDPR- General Data Protection Regulation
• A piece of legislation passed by the European Union in April of 2016
• Purpose was to give consumers better control of their personal data as it’s
collected by businesses
• It sets limitations on what companies can do with that data and how long they
can keep it.
• Companies outside of the EU must still comply with the GDPR if they want to do
business with customers in the EU
• Companies can collect just the minimum amount of data from customers
needed to conduct business with the latter.
• Furthermore, businesses will be held accountable for the use and storage of
that data.
• Businesses must designate a Data Protection Officer (DPO) whose function is
to oversee details such as GDPR compliance and data security strategy.
• If there’s a breach, the DPO must report to affected individuals and the
regulators within 72 hours of the incident.
• Penalties for non-compliance are tough. A business that violates GDPR laws
will pay fines of up to four percent of annual global turnover, or 20 million Euros,
whichever is greater.
Introduction to Big Data 110
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
How GDPR Will Affect Data Collection
• GDPR legislation will affect data collection is that it will lead to an
increased reliance on real-time analytics
• Real-time analytics takes data that has just been collected and puts it
to immediate use and analysis.
• With collected data getting an immediate turnaround, there is no need
for keeping said data around for any great length of time, which is one
of the issues that the GDPR seeks to address.
• Social media, an avenue that many businesses use for the purpose of
building customer loyalty and increasing engagement, will also be
affected.
• Furthermore, all of that Big Data being collected will have to not only
be stored securely but
• Will need to be gathered by customers who want to remove it and
switch it over to another vendor.
• Businesses of all sizes will need to come to terms with the idea that
customers will gain greater control over their own personal data.
Introduction to Big Data 111
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case Studies
• Trump election
• Target pregnancy ad
• Brexit campaign
• Ted Cruz campaign surge
• UK Tory election campaign
• Obama campaign
• Air France 447 crash
• Compas system
Introduction to Big Data 112
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case studies - Trump election
– Michal Kosinski in 2008 was accepted by Cambridge University to
do his PhD at the Psychometrics Centre
– Kosinski joined fellow student David Stillwell about a year after
Stillwell had launched a little Facebook application
“MyPersonality” app.
• enabled users to fill out different psychometric questionnaires,
including a handful of psychological questions from the Big Five
personality questionnaire
• Based on the evaluation, users received a "personality profile"—
individual Big Five values—and could opt-in to share their Facebook
profile data with the researchers
• Kosinski had expected a few dozen college friends to fill in the
questionnaire, but before long, hundreds, thousands, then millions of
people had revealed their innermost convictions.
• Suddenly, the two doctoral candidates owned the largest dataset
combining psychometric scores with Facebook profiles ever to be
collected
Introduction to Big Data 113
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case studies - Trump election(1)
– The approach of Kosinski and his colleagues:
• First, they provided test subjects with a questionnaire in the form of an online quiz.
• From their responses, the psychologists calculated the personal Big Five values of
respondents
– openness (how open you are to new experiences?), conscientiousness (how much
of a perfectionist are you?), extroversion (how sociable are you?), agreeableness
(how considerate and cooperative you are?) and neuroticism (are you easily upset?)
– we can make a relatively accurate assessment of the kind of person in front of us. This includes
their needs and fears, and how they are likely to behave
• Kosinski's team then compared the results with all sorts of other online data
from the subjects: what they "liked," shared or posted on Facebook, or what
gender, age, place of residence they specified, for example.
• Remarkably reliable deductions could be drawn from simple online actions:
– men who "liked" the cosmetics brand MAC were slightly more likely to be gay;
– one of the best indicators for heterosexuality was "liking" Wu-Tang Clan
– Followers of Lady Gaga were most probably extroverts, while those who "liked"
philosophy tended to be introverts
• While each piece of such information is too weak to produce a reliable
prediction, when tens, hundreds, or thousands of individual data points are
combined, the resulting predictions become really accurate
Introduction to Big Data 114
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case studies - Trump election(2)
– Kosinski and his team tirelessly refined their models.
– In 2012, Kosinski proved that on the basis of an average of 68 Facebook "likes" by
a user, it was possible to predict their skin color (with 95 percent accuracy), their
sexual orientation (88 percent accuracy), and their affiliation to the Democratic or
Republican party (85 percent).
– But it didn't stop there. Intelligence, religious affiliation, as well as alcohol, cigarette
and drug use, could all be determined.
– From the data it was even possible to deduce whether someone's parents were
divorced.
– The strength of their modeling was illustrated by how well it could predict a
subject's answers.
– Kosinski continued to work on the models incessantly: before long, he was able to
evaluate a person better than the average work colleague, merely on the basis of
ten Facebook "likes."
• Seventy "likes" were enough to outdo what a person's friends knew,
• 150 what their parents knew, and
• 300 "likes" what their partner knew.
• More "likes" could even surpass what a person thought they knew about themselves. On
the day that Kosinski published these findings, he received two phone calls. The
threat of a lawsuit and a job offer. Both from Facebook.
Introduction to Big Data 115
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case studies - Trump election (3)
• Only weeks later Facebook "likes" became private by default.
• Before that, the default setting was that anyone on the internet
could see your "likes."
• But this was no obstacle to data collectors: while Kosinski
always asked for the consent of Facebook users, many apps
and online quizzes today require access to private data as a
precondition for taking personality tests.
• Anybody who wants to evaluate themselves based on their
Facebook "likes" can do so on Kosinski's website, and then
compare their results to those of a classic Ocean
questionnaire, like that of the Cambridge Psychometrics
Center.
Introduction to Big Data 116
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
Case studies - Trump election (4)
• But it was not just about "likes" or even Facebook: Kosinski
and his team could now ascribe Big Five values based purely
on how many profile pictures a person has on Facebook, or
how many contacts they have (a good indicator of
extraversion).
• But we also reveal something about ourselves even when
we're not online.
• For example, the motion sensor on our phone reveals how
quickly we move and how far we travel (this correlates with
emotional instability).
• Our smartphone, Kosinski concluded, is a vast
psychological questionnaire that we are constantly filling
out, both consciously and unconsciously.
Introduction to Big Data 117
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data
• Vă mulțumim!
Introduction to Big Data 118