KEMBAR78
Course Introduction To Big Data (2021-2022) | PDF | Apache Hadoop | Cloud Computing
0% found this document useful (0 votes)
185 views118 pages

Course Introduction To Big Data (2021-2022)

This document provides an introduction to big data from the University Politehnica of Bucharest's Computer Science Department. It discusses key concepts like the human impact of big data, challenges of big data research, benefits of big data, and briefly introduces distributed platforms and cloud computing. Videos and articles are referenced to provide additional background on topics like what is big data, expectations of big data, and the data and AI landscape.

Uploaded by

Mihaela Rînja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views118 pages

Course Introduction To Big Data (2021-2022)

This document provides an introduction to big data from the University Politehnica of Bucharest's Computer Science Department. It discusses key concepts like the human impact of big data, challenges of big data research, benefits of big data, and briefly introduces distributed platforms and cloud computing. Videos and articles are referenced to provide additional background on topics like what is big data, expectations of big data, and the data and AI landscape.

Uploaded by

Mihaela Rînja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Introduction to University Politehnica of Bucharest – Computer Science Department

Big Data

Introduction to Big Data


2021-2022

Prof.dr.ing. Florin Pop


florin.pop@cs.pub.ro
Dr.ing. Cătălin Negru
catalin.negru@cs.pub.ro
Dr.ing. Bogdan Mocanu
mocanu.bogdan.costel@gmail.ro

SCPD • SSA • ABD • SEM


Introduction to Big Data 1
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The Human Face of Big Data

• Big Data will impact every part of your life: https://www.youtube.com/watch?v=0Q3sRSUYmys


• What Exactly Is Big Data? https://www.forbes.com/video/4857597029001

Introduction to Big Data 2


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Gartner Tempers The Expectations Of Big Data

Introduction to Big Data 3


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The 2017 Big Data Landscape

Introduction to Big Data 4


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data & AI Landscape 2020

Introduction to Big Data 5


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Understanding Big Data

2.5 quintillion bytes of data each day


Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly
used tools to capture, process, store, manage and analyze them can be termed as BIG DATA.

Introduction to Big Data 6


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Complex situations?
• Data distribution
– The large data set is split into chunks or smaller blocks and
distributed over N number of nodes or machines.
– Distributed File System.
• Parallel processing
– The distributed data gets the power of N number of servers and
machines in which data is residing and works in parallel for the
processing and analysis
– MapReduce (Google), Hadoop, Spark, Flink, etc.
• Fault tolerance
– We keep the replica of a single block (or chunk) of data more than
once.
• Commodity hardware
– We don’t need specialized hardware with special RAID as Data
container.
• Flexibility and Scalability
– add more and more of rack space into the cluster as the demand for
space increases.

Introduction to Big Data 7


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

A brief history of Data Science

Introduction to Big Data 8


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Timeline v2.0

Illustration by Héizel Vázquez

Introduction to Big Data 9


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Research Challenges


• Heterogeneity and incompleteness
– machine analysis algorithms expect homogeneous data;
– even after data cleaning and error correction, some incompleteness and some errors in data
are likely to remain.
• Scale
– managing large and rapidly increasing volumes of data;
– clock speeds have largely stalled and processors are being built with increasing numbers of
cores;
– parallelism across nodes in a cluster;
– move towards cloud computing;
– large clusters requires new ways of determining how to run and execute data processing jobs.
• Timeliness
– acquisition rate challenge and a timeliness challenge;
– many situations in which the result of the analysis is required immediately;
– ex: a full analysis of a user’s purchase history is not likely to be feasible in real-time.
• Privacy
– important to rethink security for information sharing in Big Data use cases;
– we do not understand what it means to share data.
• Human collaboration
– analytics for Big Data will be designed to have a human in the loop;
– crowd-sourcing;
– issues of uncertainty and error => participatory-sensing;
– the extra challenge here is the inherent uncertainty of the data collection devices.

Introduction to Big Data 10


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Major Benefits of Big Data

• Answer more questions, more completely (powerful big data


business intelligence platform)

• Faster and better decision making

• Become confident in your accurate data (solving the


problem of inaccurate, incomplete view of the data)

• Empower a new generation of employees (data scientists)

• Cost reduction and revenue generation

Introduction to Big Data 11


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Remark

We must to support and we need to encourage

fundamental research towards addressing all scientific and


technical challenges

if we want to achieve the promised benefits of


Big Data.

Introduction to Big Data 12


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction in Big Data.


Distributed platforms
Brief introduction

Datacenters, Cloud computing, HPC, Fog and


Edge computing; Mobile computing; Vanet (V2X).
Distributed Applications (practical use cases):
Smart Cities; eHealth; eGovernment; Scientific
Applications

Introduction to Big Data 13


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Datacenters
• Processing Big Data, the Google Way

Photo: GOOGLE Photo: GOOGLE

Twin banks of servers at the data center in Douglas County, Ga. Server racks at the data center in Mayes County, Okla.. Each server
use energy-thrifty blue LEDs as monitor lights. has four switches connected by a different colored cable. Google
keeps the colors the same throughout our data center so it knows
which one to replace in case of failure.

Introduction to Big Data 14


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Cloud Computing
• Definition by IBM:
– Cloud computing, often referred to as simply “the cloud,” is the delivery
of on-demand computing resources—everything from applications to data
centers—over the internet on a pay-for-use basis.
• Cloud Computing Services:
– Software as a Service (SaaS):
• applications are supplied by the cloud provider;
• applications are accessible from various client devices (e.g. web browser, API);
• Examples: GMAIL, Dropbox, Office 365, Amazon Web Services
– Platform as a Service (PaaS)
• deploy applications onto the Cloud infrastructure
• programming languages, tools and libraries are supported by the provider;
• control over the deployed applications
– Infrastructure as a Service (IaaS)
• provision with fundamental computing resources: VM’s (CPU, storage, network);
• resources are distributed and support dynamic scaling;
• deploy and run arbitrary software, meaning middleware, and operating systems;
• control over operating systems, storage, and deployed applications
• User does not manage or control the underlying Cloud infrastructure.

Introduction to Big Data 15


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Cloud Computing - Deployment models


• Private
– Single-tenant architecture
– On-premises hardware
– Direct control of infrastruscure
– Ex: IBM, Red Had, Microsoft, OpenStack,
ownCloud
• Public
– Multi-tenant architecture
– Pay-as-you-go pricing model
– Ex: AWS, Microsoft Azure, Google Cloud
Platform
• Hybrid
– Cloud bursting capabilities
– Benefit of both public and private environments
– Ex: combination of both public and private
providers

Introduction to Big Data 16


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HPC
• High Performance Computing most generally refers to the
practice of aggregating computing power in a way that
delivers much higher performance than one could get out
of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business.

Introduction to Big Data 17


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fog and Edge Computing

Introduction to Big Data 18


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Mobile computing

Introduction to Big Data 19


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Vanet (V2X)-Vehicular ad hoc networks


• Vehicle-to-Vehicle (V2V) among vehicles
– A wireless ad hoc network on the roads, suited for short range vehicular networks;
• Vehicle-to-Infrastructure (V2I) between vehicles and Road-side-Units
– Long range vehicular networks;
• Vehicle-to-X (V2X) mixed approach
– Vehicle-to-everything

Introduction to Big Data 20


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data Science and the Values of


Big Data

The 8V’s: Volume, Velocity, Variety, Variability,


Visualization, Value, Veracity, Vicissitude;
Use cases and current challenges

Introduction to Big Data 21


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Starting V

Introduction to Big Data 22


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Volume
• Big Data Volume refers to the large amounts of generated
data:
– 40 ZettaBytes by 2020;
• 300 times more than 2005;
– ~ 2.5 Quintillion (2.3 Trillions GB) bytes data created daily;
– 100 Terabytes per company stored;
– 6 billion cell phones;
– 7 billon peoples.
• Volume challenges:
– Storage, analysis and processing;
– Even the smallest optimization represents a step ahead in getting
the meaningful data in time and at a low cost;

Introduction to Big Data 23


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Velocity
• Velocity refers to the speed of generating data (the speed at
which the data is flowing):
– New York Stock Exchange captures 1TB of trade information
during each trading session;
– Modern cars have about 100 sensors;
– ~18.9 billion network connections
• about 2.5 connections per person;
– Social media messages are
spread in seconds;
• Velocity challenges:
– Response time of delivery services (for financial markets is real-
time);
– Speed of manipulating and analyzing complex data;
– The Big Data system must be able to adjust and deal with the high
loads of data at peak times;

Introduction to Big Data 24


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Variety
• Variety refers to the diversity of data:
– 150 Exabytes in healthcare (2011);
– ~30 billion pieces of contend shared on
Facebook;
– By 2014 ~420 mil. wearable, wireless
health monitors;
– 4 billion hours of videos watched on YouTube per month;
– 400 million tweets send /day by 200 mil. active users;
• Variety challenges:
– Big Data systems should handle complex data: relational data, raw data,
semi-structured or unstructured data;
– Almost 80% of the world’s data is now unorganized;
• traditional DB can’t be used anymore for storage and management
– different types of data must be processed and mapped to dedicated
resources;

Introduction to Big Data 25


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Variability
• Variability refers to data whose
meaning is constantly changing;
• For example words don’t have static
definitions and meaning vary in context:
– “Delicious muesli from the
@imaginarycafe- what a great
way to start the day!”
– “Greatly disappointed that my
local Imaginary Cafe have
stopped stocking BLTs.”
– “Had to wait in line for 45
minutes at the Imaginary Cafe
today. Great, well there’s my
lunchbreak gone…”
– “great” on its own is not a sufficient
signifier of positive sentiment;
• Variability challenges:
– ‘Understand’ context and decode the
precise meaning of words

Introduction to Big Data 26


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Visualization
• Visualization refers to presentation of data in a pictorial or
graphical format:
– Analytics presented visually:
• to understand difficult concepts;
• to identify new patters;
– Interactive visualization:
• charts;
• graphs;
• ex: https://d3js.org/, https://developers.google.com/chart/ ,
http://www.fusioncharts.com/,
• Visualization challenges:
– Develop new techniques for data visualization;
– Data exploration;
– Inconsistencies in the data structure;

Introduction to Big Data 27


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Value
• Value refers to the value that can be extracted form datasets:
– it does not matter how big the data volume is or how complex,
– it is unless we are able to extract the meaningful information;
– storing and processing meaningless data represents:
• a waste of money, time,
and obtaining the relevant information becomes harder;
– Traditional approach to Big Data: systems for storing big data, but getting
insights can be difficult:
– time-intensive, manual query and analysis;
– modern dashboarding tools;
– Modern approach to Big Data:
• Machine brains: specialized algorithms that :
– can learn to identify patterns,
– correlations and
– indicators through training and repetition;
• Big Data Value challenges:
– developing tools for getting value form Big Data;

Introduction to Big Data 28


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Veracity
• Veracity refers to the trustworthiness of the data (uncertainty of data):
– 1 in 3 business leaders don’t trust the information they use to make
decisions;
– 27% of respondents in one survey were unsure how much of their data
was inaccurate;
– Poor data quality cost US economy about $1.3 trillion/year;
• Causes of veracity:
– Rumors, spammers, collection errors,
entry errors, system errors;
• Veracity challenges:
– the volume and complexity of data
are increasing;
– quality and accuracy are becoming
less controllable;

Introduction to Big Data 29


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Vicissitude
• Vicissitude property refers to the challenge of scaling Big Data
complex workflows;

BTWorld: A Large-scale Experiment in Time-Based


Analytics

Introduction to Big Data 30


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

+ more Vs of Big Data

+ Volatility
+ Veridicity
+ Vision
+…

Introduction to Big Data 31


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

10 Cool Big Data Projects


• #1. To find exactly what we look for in the internet
• #2. To ride through a city without traffic jams
• #3. To save rare animals, catching poachers
• #4. To make our cities green
• #5. To understand why Indian cuisine is unique
• #6. To fight malaria epidemics in Africa
• #7. To grow ideal Christmas trees
• #8. To understand that our languages are filled with happiness
• #9. To make sport shows even more interesting
• #10. To improve job conditions
• #11. To enhance relationship

https://www.kaspersky.com/blog/cool-big-data-projects/8186/
Introduction to Big Data 32
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed Model for Scalable Batch


Computing

Google MapReduce. MapReduce Patterns,


Algorithms and use cases

Introduction to Big Data 33


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Philosophy to Scale for Big Data

Divide Work

Combine Results

Introduction to Big Data 34


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed processing is non-trivial


• How to assign tasks to different workers in an efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated to different
workers?
• Big Data storage is challenging
– Data Volumes are massive
– Reliability of Storing PBs of data is challenging
– All kinds of failures: Disk/Hardware/Network Failures
– Probability of failures simply increase with the number of
machines …

Introduction to Big Data 35


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Google MapReduce
• MapReduce is:
– a programming model;
– implementation for processing and generating big data sets;
• Released in 2004 by Google (Jeffrey Dean and Sanjay
Ghemawat)
• Perform simple computations while hiding the details of:
– Parallelization;
Large Dataset
R
Split data
– Data distribution; E
Split data M
D
– Load balancing; A Results
Split data U
… P
• I/O scheduling; C
Split data E
– Fault tolerance;
• Distribute data over multiple nodes
– Each node receive a small portion of data

Introduction to Big Data 36


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Google MapReduce (2)


• MapReduce programming model:
– Input & Output: each a set of key/value pairs
• Programmer must specifies two functions:
– map (in_key, in_value) à list(out_key, intermediate_value)
• Processes input key/value pair;
• Produces set of intermediate pairs;
– reduce (out_key, list(intermediate_value)) à list(out_value)
• Combines all intermediate values for a particular key;
• Produces a set of merged output values;

reduce(String output_key, Iterator


map(String input_key, String intermediate_values):
input_value): // output_key: a word
// input_key: document name // output_values: a list of counts int
// input_value: document contents f result = 0;
for each word w in input_value: for each v in intermediate_values:
EmitIntermediate(w, "1"); result += ParseInt(v);
Emit(AsString(result));

Introduction to Big Data 37


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Architecture

Introduction to Big Data 38


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Data Flow

Introduction to Big Data 39


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns
• Counting and Summing - log analysis, data querying:
– Calculate a total number of occurrences of each word in a large set of
documents;
– Process a log file :
• each record contains a response time;
• it is required to calculate an average response time.
• Collating – ETL, building of inverted indexes:
– Save all items that have the same value of a function into one file
• Filtering (“Grepping”), Parsing, and Validation - log analysis, data
querying, ETL, data validation:
– Collect all records that meet some condition;
– Transform each record (independently from other records) into another
representation

Introduction to Big Data 40


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns (2)


• Distributed Task Execution – physics and engineering simulations,
numerical analysis, performance testing:
– Large computational divided into multiple parts;
– Results from all parts are combined together to obtain a final result.
• Sorting - ETL, data analysis:
– Sort large set of records by some rule;
– Process records in a certain order.
• Iterative Message Passing (Graph Processing) - graph analysis,
web indexing:
– Calculate a state of each entity in a network on the basis of properties of
the other entities in its neighborhood;
– State can represent :
• distance to other nodes;
• indication that there is a neighbor with the certain properties;
• characteristic of neighborhood density ;
• etc.

Introduction to Big Data 41


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MapReduce Patterns (3)


• Distinct Values (Unique Items Counting) – log analysis, unique users
counting
– Set of records that contain fields F and G;
– Count the total number of unique values of field F for each subset of records that
have the same G (grouped by G).
• Cross-Correlation -text analysis, market analysis:
– A set of tuples of items ;
– Calculate a number of tuples where some items co-occur for each possible pair.
• Relational MapReduce Patterns:
– Selection
– Projection
– Union
– Intersection
– Difference
– GroupBy and Aggregation
– Joining

Introduction to Big Data 42


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Algorithms and use cases


• PageRank and Mapper-Side Data Aggregation
– Used by Google to calculate relevance of a web page as a function of
authoritativeness (PageRank) of pages that have links to this page.
– the algorithm is complex
– its core it is just a propagation of weights between nodes
• each node calculates its weight as a mean of the incoming weights:
class N
State is PageRank
method getMessage(object N)
return N.State / N.OutgoingRelations.size()
method calculateState(state s, data [d1, d2,…])
return ( sum([d1, d2,...]) )

class Mapper class Reducer


method Initialize method Reduce(id m, [s1, s2,...])
H = new AssociativeArray
method Map(id n, object N)
M = null
p = N.PageRank / p = 0
N.OutgoingRelations.size() for all s in [s1, s2,...] do
Emit(id n, object N) if IsObject(s) then
for all id m in N.OutgoingRelations M = s
do else
H{m} = H{m} + p
method Close p = p + s
for all id n in H do M.PageRank = p
Emit(id n, value H{n}) Emit(id m, item M)
Introduction to Big Data 43
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Example


• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster

Owen O’Malley (Yahoo!)


Introduction to Big Data 44
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Mapper


public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,


OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}

Introduction to Big Data 45


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Word Count Reducer


public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {

public static void reduce(Text key, Iterator<IntWritable>


values, OutputCollector<Text,IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Introduction to Big Data 46


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Putting it all together


JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Introduction to Big Data 47


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Hadoop

Programming environment and use cases


* Based on slides by Dhruba Borthakur, Owen O’Malley (Yahoo!)

Introduction to Big Data 48


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

The history of an elephant


• Hadoop is an open-source implementation based on Google
File System (GFS) and MapReduce from Google
• Hadoop was created by Doug Cutting and Mike Cafarella in
2005
• Hadoop is an Apache open source project from 2006
• Why Hadoop?
– Need to process huge datasets on large clusters of computers
– Very expensive to build reliability into each application
– Nodes fail every day
• Failure is expected, rather than exceptional
• The number of nodes in a cluster is not constant
– Need a common infrastructure
• Efficient, reliable, easy to use

Introduction to Big Data 49


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Hadoop Basic Modules

Computation

Storage

Introduction to Big Data 50


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Distributed File System


• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple Data Nodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from Data Node

Introduction to Big Data 51


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS)
paper, 2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files
– Random access to files
– Low latency access

Introduction to Big Data 52


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Design of HDFS
• HDFS files are divided into blocks
• Master-Slave design
– It’s the basic unit of read/write
• Master Node – Default size is 64MB, could be larger (128MB)
– Single NameNode for managing metadata – Hence makes HDFS good for storing larger
files
• Slave Nodes • HDFS blocks are replicated multiple times
– Multiple DataNodes for storing data – One block stored at multiple location, also at
• Other different racks (usually 3 times)
– This makes HDFS storage fault tolerant and
– Secondary NameNode as a backup faster to read

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode NameNode keeps the


metadata, the name,
location and directory
DataNode DataNode DataNode DataNode DataNode provide storage
for blocks of data
Heartbeat, Commands, Data

Introduction to Big Data 53


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Architecture

Introduction to Big Data 54


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Blocks


• Reasons:
– File can be larger than a single disk
– Block is of fixed size, easy to manage and manipulate
– Easy to replicate and do more fine grained load balancing
• HDFS Block size is by default 64 MB (larger than regular file
system block)
– Minimize overhead: disk seek time is almost constant
– Example: seek time: 10 msec, file transfer rate: 100MB/s,
overhead (seek time/a block transfer time) is 1%, what is the block
size?
– 100 MB (HDFS -> 128 MB)

Introduction to Big Data 55


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc.

Introduction to Big Data 56


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

Introduction to Big Data 57


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Block Placement and Heartbeats


• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
• DataNodes send hearbeat to the NameNode
– Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure

Introduction to Big Data 58


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fault tolerance support in HDFS


• Replication of blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node


B1 B2 B4 B3

Node Node Node


B1 Node
B3 B1 B2 B4

Node Node Node Node


B4 B3 B1 B2

Introduction to Big Data 59


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Name Node


Snapshot of FS Edit log: record
Name Node
changes to FS

Filename Replication factor Block ID

File 1 3 [1, 2, 3]

File 2 2 [4, 5, 6]

File 3 1 [7,8]

Data Nodes

1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6

Introduction to Big Data 60


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Read


1
Name Node

Client
2

• Prevent NN from being


3 4
the bottleneck of the
cluster
... • Allow HDFS to scale to
DN1 DN2 DN3 DNn
large number of
concurrent clients
• Spread the data traffic
across the cluster
1. Client connects to NN to read data
2. NN tells client where to find the data blocks
3. Client reads blocks directly from data nodes (without going through NN)
4. In case of node failures, client connects to another node that serves the
missing block

Introduction to Big Data 61


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

HDFS Inside: Write

1
Name Node
Tradeoffs: Reliability,
Write Bandwidth and
Client
Read Bandwidth
2
4

DN1 DN2 DN3 ... DNn

1. Client connects to NN to write data


2. NN tells client write these data nodes
3. Client writes blocks directly to data nodes with desired replication factor
4. In case of node failures, NN will figure it out and replicate the missing blocks

Introduction to Big Data 62


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Example of HDFS shell commands


Create a directory in HDFS
• hadoop fs -mkdir /user/bigdata/dir1

List the content of a directory


• hadoop fs -ls /user/bigdata

Upload and download a file in HDFS


• hadoop fs -put /home/bigdata/file.txt /user/bigdata/datadir/
• hadoop fs -get /user/bigdata/datadir/file.txt /home/

Look at the content of a file


• hadoop fs -cat /user/bigdata/datadir/book.txt

Many more commands, similar to Unix

Introduction to Big Data 63


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Who uses Hadoop?


• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more

Hadoop Cluster at Yahoo! (Credit: Yahoo)

Introduction to Big Data 64


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

A real world example of New York Times


• Goal: Make entire archive of articles available online: 11
million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud (EC2) and
Simple Storage System (S3)
• Time: < 24 hours
• Costs: ~ $240

Introduction to Big Data 65


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Hadoop Ecosystem
• Pig: High-level language for data analysis (Yahoo! Research)
• Hive: SQL-like Query language and Metastore (Facebook)
• HBase: Table storage for semi-structured data (Modeled on
Google’s Bigtable)
• Zookeeper: Coordinating distributed applications
• Mahout: Machine learning

Introduction to Big Data 66


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

In-Memory Cluster Computing

Apache Spark

Introduction to Big Data 67


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Overview of LAMBDA Architecture


• The Lambda Architecture has three major components:
– Batch layer provides the following functionality
• managing the master dataset, an immutable, append-only set of raw data
• pre-computing arbitrary query functions, called batch views.
– Serving layer
• indexes the batch views so that they
can be queried in ad hoc with low latency.
– Speed layer
• accommodates all requests that are
subject to low latency requirements.
• use fast and incremental algorithms
• deals with recent data only.

Introduction to Big Data 68


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction to Big Data 69


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Spark
• Spark is a general-purpose computing framework for iterative
tasks
• API is provided for Java, Scala and Python
• The model is based on MapReduce
– enhanced with new operations and an engine that supports
execution graphs
• Tools include Spark SQL, MLLlib for machine learning, GraphX
for graph processing and Spark Streaming

• Unifies batch, streaming,


interactive computing
• Making it easy to build
sophisticated applications

Introduction to Big Data 70


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Key idea in Spark


• Resilient distributed datasets (RDDs)
– Immutable collections of objects across a cluster
– Built with parallel transformations (map, filter, …)
– Automatically rebuilt when failure is detected
– Allow persistence to be controlled (in-memory operation)
• Transformations on RDDs
– Lazy operations to build RDDs from other RDDs
– Always creates a new RDD
• Actions on RDDs
– Count, collect, save

Introduction to Big Data 71


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Spark overview

Introduction to Big Data 72


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

YARN

Introduction to Big Data 73


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MESOS Architecture

Apache Mesos is an open source cluster manager that handles workloads in a


distributed environment through dynamic resource sharing and isolation.

Introduction to Big Data 74


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Task Scheduler
• Supports general task graphs
• Pipelines functions where possible
• Cache-aware data reuse & locality
• Partitioning-aware to avoid shuffles
• MESOS provides resource
allocation (offer resources to
framework, accept/reject by
framework scheduler)

Introduction to Big Data 75


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Implementing Spark Algorithms


• Broadcast everything
– Master broadcasts data and initial models
– At each iteration updated models are broadcast by master (driver
program)
– Does not scale well due to communication overhead
• Data parallel
– Worker loads data
– Master broadcasts initial models
– At each iteration updated models are broadcast by master
– Works for large datasets, because data is available to workers
• Fully parallel
– Workers load data and they instantiate the models
– At each iteration, models are shared via join between workers
– Much better scalability
Introduction to Big Data 76
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Machine Learning and Spark


• Spark RDDs support efficient data sharing
• In-memory caching increases performance
– Reported to have performance of up to 100 times faster than
Hadoop in memory or 10 times faster on disk
• High-level programming interface for complex algorithms

Introduction to Big Data 77


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

MLBase and MLlib


• MLBase has been designed for simplifying the development of
machine learning pipelines:
– MLlib is a machine learning library
– MLI (ML Developer API) is an API for machine learning
development that aims to abstract low-level details from the
developers
– MLOpt is a declarative layer that aims to automate the machine
learning pipeline
• The idea is that the system searches feature extractors and models
best fit for the ML task

Introduction to Big Data 78


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Graph-Parallel Systems
• Graph-based computation depends only on the neighbors of a
particular vertex
– “Think like a Vertex.” – Pregel (SIGMOD 2010)
• Systems with specialized APIs to simplify graph processing
– Pregel from Google
• Push abstraction: Vertex programs interact by sending messages
• Receive msgs, process, send msgs
• GraphLab
– Pull abstraction: Vertex programs access adjacent vertices and
edges
– Foreach (j in neighbours) calculate pagerank total for j

Introduction to Big Data 79


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

GraphX
• Separation of system support for each view (table, graph) involves
expensive data movement and duplication
• GraphX makes tables and graphs views of the same physical data
• The views have
their own optimized
semantics
– Table operators
inherited from Spark
– Graph operators
form relational algebra

Introduction to Big Data 80


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Performance Gains for PageRank


Revisited
• Spark is reported to be 4x faster than Hadoop
• Graphlab is 16x faster than Spark
• GraphX is roughly 3x slower than Graphlab
• GraphX is reported to compare favourably to Graphlab with
pipelines (raw -> hyperlink -> pagerank -> top 20)
• Graph structure can be exploited for significant performance
gains

Introduction to Big Data 81


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Spark Streaming
• Spark extension of accepting and processing of streaming high-
throughput live data streams
• Data is accepted from various sources
– Kafka, Flume, TCP sockets, Twitter, …
• Machine learning algorithms and graph processing algorithms can be
applied for the streams
• Similar systems
– Twitter (Storm),
Google (MillWheel),
Yahoo! (S4)

Introduction to Big Data 82


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Stream Processing: Discretized


• Streaming computation: a series of very small deterministic batch jobs
– Live stream is divided into batches of x seconds
– Each batch of data is an RDD and RDD operations can be used
– Results are also returned in batches
– Batch size as low as 0.5 seconds, results in approx. One second latency
• Can combine streaming and batch processing
Click stream processing example
Event vs. Processing Time
• There’s a difference between
even time (te) and processing
time (tp)
• Events arrive out-of order even
during normal operation
• Events may arrive arbitrary late
• Apply a grace period before
processing events
• Allow arbitrary update windows of
metrics
Introduction to Big Data 83
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

K – Architecture
• A distributed data operating system is emerging
– Supported by YARN and MESOS
• Various data services on top of this (Hadoop and Spark)
• Some services are being integrated (for example in Spark) for better
coherence and performance
• Important points
– Data format (row/column, block size)
– Network topology and data/code placement
– Algorithm structure and coordination
– Scheduling and resource management
• Big Data Frameworks are evolving
– Spark represents unification of streaming, machine learning and graphs
• Big Data pipeline management is at an early stage
– How to achieve better mapping between cluster resources, scheduling, and
pipelines?

Introduction to Big Data 84


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Frameworks for real-time data stream


analytics

Apache Storm

Introduction to Big Data 85


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Storm Architecture


• Nimbus: Supervisor

– Distributes codes around cluster


– Assigns tasks to machines supervisors ZooKeeper
Supervisor
– Failure monitoring
– Is fail-fast and stateless Nimbus ZooKeeper
Supervisor
• Supervisor:
– Listens for work assigned to its machine
ZooKeeper
– Starts and stops worker processes Supervisor
– Shuts down worker processes
• Topology: process messages forever
Supervisor
– A running topology consists of many worker processes spread across
many machines (graph of computation)
• Zookeper: Supervisor

– Handle cluster coordination


– Monitor working node status
– helps the Supervisor to interact with the Nimbus
Introduction to Big Data 86
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Storm
• Simple programming model
– Topology - computation graph that run forever
• node in a topology contains processing logic
• links between nodes indicate how data should be passed around
between nodes
– Spouts - source of streams
– Bolts - consumes input streams
• Programming language agnostic
– ( Clojure, Java, Ruby, Python default )
• Fault-tolerant
• Horizontally scalable
– Ex: 1,000,000 messages per second on a 10 node cluster
• Guaranteed message processing
• Fast : Uses zeromq message queue
• Local Mode : Easy unit testing
Introduction to Big Data 87
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Stream and Spouts


• Stream
– Unbounded sequence of tuples ( storm data model ) • <key,
value(s)> pair ex. <“UIUC”, 5>

• Spouts
– Source of streams : Twitterhose API
– Stream of tweets or some crawler
– have interfaces that you implement to run application-specific
logic

Introduction to Big Data 88


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Bolts
• Bolts:
– Process (one or more) input stream and produce new streams
• Functions
– Filter, Join, Apply/Transform etc
– Parallelize to make it fast! – multiple processes constitute a
bolt

Introduction to Big Data 89


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Topology & Grouping


• Topology:
– Graph of computation – can have cycles
– Network of Spouts and Bolts
– Spouts and bolts execute as many tasks across the
cluster
• Grouping
– How to send tuples between the components / tasks?
• Shuffle Grouping
– Distribute streams “randomly” to bolt’s tasks
• Fields Grouping
– Group a stream by a subset of its fields
• All Grouping
– All tasks of bolt receive all input tuples useful for joins
• Global Grouping
– entire stream goes to a single one of the bolt's
tasks
– Pick task with lowers id
Introduction to Big Data 90
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Zookeeper
• Open source server for highly reliable distributed coordination.
• As a replicated synchronization service with eventual
consistency;
• Features:
– Robust
– Persistent data replicated across multiple nodes
– Master node for writes
– Concurrent reads
– Comprises a tree of znodes, - entities roughly representing file
system nodes.
– Use only for saving small configuration data.

Introduction to Big Data 91


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Message Processing


• Guranteed Message Processing:
– Message is "fully processed" when the tuple tree has been
exhausted and every message in the tree has been processed
– A tuple is considered failed when its tree of messages fails to be
fully processed within a specified timeout.

Introduction to Big Data 92


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Fault Tolerance APIS


• Emit(tuple, output)
– Emits an output tuple, perhaps anchored on an input tuple (first
argument)
• Ack(tuple)
– Acknowledge that you (bolt) finished processing a tuple
• Fail(tuple)
– Immediately fail the spout tuple at the root of tuple topology I there
is an exception from the database, etc.
• Must remember to ack/fail each tuple
– Each tuple consumes memory. Failure to do so results in memory
leaks.

Introduction to Big Data 93


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Fault Tolerance


• Anchoring
– Specify link in the tuple tree.
• ( anchor an output to one or more input tuples.)
– At the time of emitting new tuple
– Replay one or more tuples.
• “acker” tasks
– Track DAG of tuples for every spout
– Every tuple ( spout/bolt ) given a random 64 bit id
– Every tuple knows the ids of all spout tuples for which it exits.
• How?
– Every individual tuple must be acked.
– If not task will run out of memory!
– Filter Bolts ack at the end of execution
– Join/Aggregation bolts use multi ack .
Introduction to Big Data 94
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Failure Handling
• A tuple isn't acked because the task died: Spout tuple ids at the root of the
trees for the failed tuple will time out and be replayed.
• Acker task dies: All the spout tuples the acker was tracking will time out and be
replayed.
• Spout task dies: The source that the spout talks to is responsible for replaying the
messages.
• For example, queues like Kestrel and RabbitMQ will place all pending messages back on
the queue when a client disconnects.
• Major breakthrough : Tracking algorithm
– Storm uses mod hashing to map a spout tuple id to an acker task.
• Acker task: Stores a map from a spout tuple id to a pair of values.
• Task id that created the spout tuple
• Second value is 64bit number : Ack Val
• XOR all tuple ids that have been created/acked in the tree.
– Tuple tree completed when Ack Val = 0
• Configuring Reliability
– Config.TOPOLOGY_ACKERS to 0.
– you can emit them as unanchored tuples

Introduction to Big Data 95


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Failure Scenario
• Lets take a scenario:
– Count aggregation of your stream
– Store running count in database. Increment count after processing tuple.
– Failure!
• Design:
– Tuples are processed as small batches.
– Each batch of tuples is given a unique id called the "transaction id" (txid).
– If the batch is replayed, it is given the exact same txid.
– State updates are ordered among batches.
• Example: man => [count=3, txid=1]
– Processing txid = 3 dog => [count=4, txid=3]
– Database state apple => [count=10,
txid=2]
• If they're the same : SKIP man => [count=5,
["man"] ( Strong Ordering ) txid=3]
["man"] • If they're different, dog => [count=4, txid=3]
["dog"] you increment the count. apple => [count=10,
txid=2]

Introduction to Big Data 96


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Storm Future Improvements


• Lax security policies
• Performance and scalability improvements
– Presently with just 20 nodes SLAs that require processing more
than a million records per second is achieved.
• High Availability (HA) Nimbus
• Though presently not a single point of failure, it does affect
degrade functionality.
• Enhanced tooling and language support

Introduction to Big Data 97


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Machine Learning for Big Data

The Flink model


Based on
http://linc.ucy.ac.cy/file/Talks/talks/DeepAnalysisw
ithApacheFlink_2nd_cloud_workshop.pdf

Introduction to Big Data 98


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Data Processing and Machine learning Methods

• Data processing (third trend)


– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
Processing
– Tools for processing of streaming, ETL
multimedia & batch data (extract,
• Machine Learning (fourth trend) transform,
load)
– Classification
– Regression
Machine
– Clustering Big Datasets
Learning
– Collaborative filtering

Working at the Intersection of these


four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing

Introduction to Big Data 99


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Apache Flink
• Massive parallel data flow engine with unified batch- and stream-
processing
– Batch and Stream APIs on top of a streaming engine

• Performance and ease of use


– Exploits in-memory processing and pipelining, language-embedded
logical APIs
• A runtime that "just works" without tuning
– custom memory management inside the JVM
• Predictable and dependable execution
– Bird’s-eye view of what runs and how, and what failed and why

Introduction to Big Data 100


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink System Stack

• Rich set of operators


§ Map, Reduce, Join,
CoGroup, Union, Iterate,
Delta Iterate, Filter, FlatMap,
GroupReduce, Project,
Aggregate, Distinct, Vertex-
Update, Accumulators

Introduction to Big Data 101


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink K-Means Clustering


• Cluster analysis in data mining
• Partitions n observations into k clusters
• Assign points to cluster with smallest (euclidian) distance
• Input:
– Set of observations X= {x1, x2, …, xn}
– Number of clusters K
– Convergence criteria K
• Result:
– Set of clusters C= {c1, c2, …, ck}

Init:select k data points as cluster centroids C= {c1, c2, …, ck}


Compute:
do
foreach x in X
assagin x to cluster with closest centroid
recompute centroido feach cluster
while ξi s not reached (or fixed number of iterations)
16

Introduction to Big Data 102


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

K-Means in Flink
// initialize
// points: n observations, centroids: initial k centroids
valcntrds= centroids.iterate(10) { currCntrds=>
valnewCntrds= points
.map(findNearestCntrd).withBroadcastSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 +y._2, x._3 +y._3) )
.map( x =>Centroid(x._1, x._2 /x._3) )
newCntrds }
{1, 3, 5, 7}.reduce { (x, y) =>x +y }

Introduction to Big Data 103


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Flink machine learning library


valfeatureExtractor=HashingFT()
valfactorizer=ALS()

valpipeline =featureExtractor.chain(factorizer)

valclickstreamDS=
env.readCsvFile[(String, String, Int)](clickStreamData)

valparameters =ParameterMap()
.add(HashingFT.NumFeatures, 1000000)
.add(ALS.Iterations, 10)
.add(ALS.NumFactors, 50)
.add(ALS.Lambda, 1.5)

valfactorization =pipeline.fit(clickstreamDS, parameters)

Currently available algorithms: Classification, Logistic Regression, Clustering,


Recommendation (ALS- alering least squares)

Introduction to Big Data 104


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Ethical Issues on Big Data

Rules and Regulations, Data Economy, Business


Models

Introduction to Big Data 105


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Ethics


• “Big Data” Revolution comparable with Industrial Revolution
– all kinds of human activities and decisions are beginning to be
influenced by big data predictions
• dating,
• shopping,
• medicine,
• education,
• voting,
• law enforcement, terrorism prevention, and cybersecurity
• Privacy
– Privacy as Information Rules
– Shared Private Information Can Remain Confidential
– Transparency
• Identity

Introduction to Big Data 106


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Big Data Ethics Principles


• Beneficial
– it should deliver value to all concerned parties
• Progressive
– deliver better and more valuable results
– minimize data usage to promote more sustainable and less risky
analysis
• Sustainable
– provide value that is sustainable over a reasonable time frame
• Data sustainability
• Algorithmic sustainability
• Device- and/or manufacturer-based sustainability
• Respectful
• Fair

Introduction to Big Data 107


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Key issues for Etics in Big Data


• Privacy
• Transparency
• Confidentiality
• Identity protection

Introduction to Big Data 108


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Introduction to Big Data 109


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

GDPR- General Data Protection Regulation


• A piece of legislation passed by the European Union in April of 2016
• Purpose was to give consumers better control of their personal data as it’s
collected by businesses
• It sets limitations on what companies can do with that data and how long they
can keep it.
• Companies outside of the EU must still comply with the GDPR if they want to do
business with customers in the EU
• Companies can collect just the minimum amount of data from customers
needed to conduct business with the latter.
• Furthermore, businesses will be held accountable for the use and storage of
that data.
• Businesses must designate a Data Protection Officer (DPO) whose function is
to oversee details such as GDPR compliance and data security strategy.
• If there’s a breach, the DPO must report to affected individuals and the
regulators within 72 hours of the incident.
• Penalties for non-compliance are tough. A business that violates GDPR laws
will pay fines of up to four percent of annual global turnover, or 20 million Euros,
whichever is greater.
Introduction to Big Data 110
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

How GDPR Will Affect Data Collection


• GDPR legislation will affect data collection is that it will lead to an
increased reliance on real-time analytics
• Real-time analytics takes data that has just been collected and puts it
to immediate use and analysis.
• With collected data getting an immediate turnaround, there is no need
for keeping said data around for any great length of time, which is one
of the issues that the GDPR seeks to address.
• Social media, an avenue that many businesses use for the purpose of
building customer loyalty and increasing engagement, will also be
affected.
• Furthermore, all of that Big Data being collected will have to not only
be stored securely but
• Will need to be gathered by customers who want to remove it and
switch it over to another vendor.
• Businesses of all sizes will need to come to terms with the idea that
customers will gain greater control over their own personal data.
Introduction to Big Data 111
Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case Studies
• Trump election
• Target pregnancy ad
• Brexit campaign
• Ted Cruz campaign surge
• UK Tory election campaign
• Obama campaign
• Air France 447 crash
• Compas system

Introduction to Big Data 112


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election


– Michal Kosinski in 2008 was accepted by Cambridge University to
do his PhD at the Psychometrics Centre
– Kosinski joined fellow student David Stillwell about a year after
Stillwell had launched a little Facebook application
“MyPersonality” app.
• enabled users to fill out different psychometric questionnaires,
including a handful of psychological questions from the Big Five
personality questionnaire
• Based on the evaluation, users received a "personality profile"—
individual Big Five values—and could opt-in to share their Facebook
profile data with the researchers
• Kosinski had expected a few dozen college friends to fill in the
questionnaire, but before long, hundreds, thousands, then millions of
people had revealed their innermost convictions.
• Suddenly, the two doctoral candidates owned the largest dataset
combining psychometric scores with Facebook profiles ever to be
collected

Introduction to Big Data 113


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election(1)


– The approach of Kosinski and his colleagues:
• First, they provided test subjects with a questionnaire in the form of an online quiz.
• From their responses, the psychologists calculated the personal Big Five values of
respondents
– openness (how open you are to new experiences?), conscientiousness (how much
of a perfectionist are you?), extroversion (how sociable are you?), agreeableness
(how considerate and cooperative you are?) and neuroticism (are you easily upset?)
– we can make a relatively accurate assessment of the kind of person in front of us. This includes
their needs and fears, and how they are likely to behave
• Kosinski's team then compared the results with all sorts of other online data
from the subjects: what they "liked," shared or posted on Facebook, or what
gender, age, place of residence they specified, for example.
• Remarkably reliable deductions could be drawn from simple online actions:
– men who "liked" the cosmetics brand MAC were slightly more likely to be gay;
– one of the best indicators for heterosexuality was "liking" Wu-Tang Clan
– Followers of Lady Gaga were most probably extroverts, while those who "liked"
philosophy tended to be introverts
• While each piece of such information is too weak to produce a reliable
prediction, when tens, hundreds, or thousands of individual data points are
combined, the resulting predictions become really accurate

Introduction to Big Data 114


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election(2)


– Kosinski and his team tirelessly refined their models.
– In 2012, Kosinski proved that on the basis of an average of 68 Facebook "likes" by
a user, it was possible to predict their skin color (with 95 percent accuracy), their
sexual orientation (88 percent accuracy), and their affiliation to the Democratic or
Republican party (85 percent).
– But it didn't stop there. Intelligence, religious affiliation, as well as alcohol, cigarette
and drug use, could all be determined.
– From the data it was even possible to deduce whether someone's parents were
divorced.
– The strength of their modeling was illustrated by how well it could predict a
subject's answers.
– Kosinski continued to work on the models incessantly: before long, he was able to
evaluate a person better than the average work colleague, merely on the basis of
ten Facebook "likes."
• Seventy "likes" were enough to outdo what a person's friends knew,
• 150 what their parents knew, and
• 300 "likes" what their partner knew.
• More "likes" could even surpass what a person thought they knew about themselves. On
the day that Kosinski published these findings, he received two phone calls. The
threat of a lawsuit and a job offer. Both from Facebook.

Introduction to Big Data 115


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election (3)


• Only weeks later Facebook "likes" became private by default.
• Before that, the default setting was that anyone on the internet
could see your "likes."
• But this was no obstacle to data collectors: while Kosinski
always asked for the consent of Facebook users, many apps
and online quizzes today require access to private data as a
precondition for taking personality tests.
• Anybody who wants to evaluate themselves based on their
Facebook "likes" can do so on Kosinski's website, and then
compare their results to those of a classic Ocean
questionnaire, like that of the Cambridge Psychometrics
Center.

Introduction to Big Data 116


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

Case studies - Trump election (4)


• But it was not just about "likes" or even Facebook: Kosinski
and his team could now ascribe Big Five values based purely
on how many profile pictures a person has on Facebook, or
how many contacts they have (a good indicator of
extraversion).
• But we also reveal something about ourselves even when
we're not online.
• For example, the motion sensor on our phone reveals how
quickly we move and how far we travel (this correlates with
emotional instability).
• Our smartphone, Kosinski concluded, is a vast
psychological questionnaire that we are constantly filling
out, both consciously and unconsciously.

Introduction to Big Data 117


Introduction to University Politehnica of Bucharest – Computer Science Department
Big Data

• Vă mulțumim!

Introduction to Big Data 118

You might also like