Data Science & Ethics
Lecture 2
Data Gathering: Privacy
Prof. David Martens
david.martens@uantwerp.be
Data Science & Ethics
▪ Understand the ethical aspects of data science
▪ Crucial for business, large and small
▪ Data science has impact
• Costs and benefits for businesses
• Decisions on humans
• More than making calls to predefined Python libraries
1
AI ethics in the news
3
Data Science & Ethics
▪ AI ethics in the news
▪ Privacy and GDPR
▪ Encryption and hashing
▪ CIA
▪ Backdoors
4
Privacy
Silicon Valley episode “Facial Recognition”: COO Jared makes an analogy to the Manure Crisis of
1894 during an interview with Bloomberg TV. 5
https://www.hbo.com/silicon-valley/season-5/5-facial-recognition
Privacy
▪ Personal data
• Much out there
• Much can be predicted
• Once personal data is shared online, hard to make private
again
▪ Solutions
• Awareness
• Regulations
• Technology: Silicon Valley episode “Facial Recogniton”
6
Privacy
▪ Cambridge Analytica (2013-2018)
• Harvested 50 million Facebook profiles without consent, used
for political targeting
• Public profile, page likes, birthday, current city
• App: thisisyourdigitallife
➢ Also data on user’s Facebook friends were sent
• Facebook policy’s did not allow for this
• Certified later to Facebook that “the data was destroyed”
7
Privacy
▪ Modern-day Panoptes Et alors?
• Security cameras
• Facebook, Internet as a whole
• Behavioral data
▪ Privacy is a human right
• “No one shall be subjected to arbitrary interference with his privacy,
family, home or correspondence, nor to attacks upon his honour and
reputation.“
The United Nations' 1984 Universal Declaration of Human Right (Article 12)
▪ “You have zero privacy anyway. Get over it.”
Sprenger (1999-01-26), chairman of Sun Microsystems
▪ “Who was nothing to hide, has nothing to fear”
Privacy
▪ “You have zero privacy anyway. Get over it.”
Sprenger (1999-01-26), chairman of Sun Microsystems
▪ “Who was nothing to hide, has nothing to fear”
• What: e.g. sexual activity, financial situation, phsyical appearance,
your entire internet history
• To whom: e.g. Facebook, fellow students, professor, (ex-)partners
Privacy
▪ “You have zero privacy anyway. Get over it.”
Sprenger (1999-01-26), chairman of Sun Microsystems
▪ “Who was nothing to hide, has nothing to fear”
• What: e.g. sexual activity, financial situation, phsyical
appearance, your entire internet history
• To whom: e.g. Facebook, fellow students, professor, (ex-)
partners
• Reverses the argument of “the burden of proof is on the one
who declares, not on one who denies”
• Edward Snowden: “Arguing that you don't care about the right
to privacy because you have nothing to hide is no different than
saying you don't care about free speech because you have
nothing to say.“
• Often discusssion in wake of terrible events: continuum
GDPR
▪ General Data Protection Regulation
• May 25th, 2018
• Privacy and data protection of European citizens, also
applicable to non-European companies.
• Goal: harmonize and bring European laws up to speed
• “world’s most robust data protection rules”
• Fines up to 20 million € or 4% of turnover
12
GDPR key concepts
▪ Personal data
• “any information relating to an individual, whether it relates to his or
her private, professional or public life. It can be anything from a
name, a home address, a photo, an email address, bank details,
posts on social networking websites, medical information, or a
computer's IP address.”
▪ Anonymisation
• Not menionned in GDPR
• Data that can not be brought back to an individual (not re-identifiable)
▪ Psydonimisation
• “means the processing of personal data in such a manner that the
personal data can no longer be attributed to a specific data subject
without the use of additional information, provided that such
additional information is kept separately and is subject to technical
and organisational measures to ensure that the personal data are
not attributed to an identied or identiable natural person."
13
**Anonymization** and **pseudonymization** are two
distinct approaches used to protect privacy in data
handling, particularly in compliance with regulations like
the General Data Protection Regulation (GDPR) in the ### Pseudonymization
EU. Here’s a breakdown of each and their differences: - **Definition**: Pseudonymization replaces private
identifiers with artificial identifiers or pseudonyms.
### Anonymization However, the link between the pseudonym and the
- **Definition**: Anonymization is the process of individual's identity is retained and can be restored
transforming data in such a way that individuals cannot with additional information.
be identified, either directly or indirectly, by any party. - **Characteristics**:
Once data is anonymized, it is no longer considered - **Reversibility**: The original data can be
personal data and is generally exempt from privacy retrieved if necessary, with the right "key" or linking
regulations like the GDPR. information.
- **Characteristics**: - **Examples**: Replacing names with unique
- **Irreversibility**: The process removes or alters IDs, keeping a separate database that maps these
identifiable elements permanently, making it nearly pseudonyms to actual identities.
impossible to re-identify individuals. - **Use Cases**: Common in scenarios where
- **Examples**: Removing or generalizing details such data still needs to be associated with individuals,
as names, specific dates, or exact geographic locations. like in medical research or employee performance
- **Use Cases**: Anonymized data is often used in tracking, but privacy must be protected.
research, statistical analysis, and open data projects, - **Privacy Level**: Medium, as data can
where insights are valuable but individual identities are potentially be re-identified if the key or additional
not needed. data is accessed.
- **Privacy Level**: Very high, as the data cannot be - **Limitations**: It is less secure than
traced back to any individual. anonymization because it retains identifiable
- **Limitations**: True anonymization is challenging, and connections.
small details can sometimes still lead to re-identification,
especially with advanced algorithms and supplementary
data.
GDPR key concepts
▪ Personal, anonymous or pseudonomynous?
• “David Martens”
• Encrypting name
• Hashing name
14
GDPR key concepts
▪ Do we need personal identifiers?
▪ Privacy (Barocas and Nissenbaum, 2014) as reachable: “the possibility of
knocking on your door, hauling you out of bed, calling your phone
number, threatening you with sanction, holding you accountable - with
or without access to identifying information."
▪ Data science allows to predict without need for identity
• Predict personal characteristics, pregnancy, product interest, etc.
• Privacy remains a subject to actively think about
• Privacy of data subject vs. model applicant
• “anonymity is not an escape from the ethical debates that researchers should
be having about their obligations not only to their data subjects, but also to
others who might be affected by their studies for precisely the reasons they
have chosen to anonymize their data subjects” (B and N, 2014)
15
GDPR
▪ When does GDPR allow processing of personal data?
(Article 6)
1. unambiguous consent of the data subject,
2. to fulfill a contract to which the data subject is party,
3. compliance with a legal obligation,
4. protection of vital interests of the data subjects,
5. performance of a task carried out in the public interest,
6. legitimate interest (subject to a balancing act between the
data subject's rights and the interests of the controller)
▪ Complex, e.g. unambigiuous consent:
• Understand how consent plays out.
• Complex ad tech system for example.
• What to report?
16
GDPR
▪ Complex, e.g. legitimate interest for a pizza place
• Coupons through the postal services: opt-out
• Targeted advertising, using external data from supermarket,
with impact on price: informed consent
• Sell data to insurance company to adapt health premiums: no
• Balancing between what a reasonable person would find
acceptable and what the potential impact is: continuum!
17
GDPR
▪ Article 5:
• 5.1: (Good) principles relating to processing of personal data
• 5.2: The controller shall be responsible for, and be able to
demonstrate compliance with, paragraph 1
(b) PURPOSE LIMITATION
Belgian mayor
(a) LAWFULNESS, (c) DATA MINIMISATION
FAIRNESS AND Payment service provider
TRANSPARENCY
La Liga ✓ (d) ACCURACY
Hungarian bank
(e) STORAGE LIMITATION
Danish taxi company
(f) INTEGRITY AND CONFIDENTIALITY
Portugese hospital
18
Discussion Case
19
CIA
▪ Secure storage and communication of data & information
• CIA/IAC Triad
➢ Confidentiality: data only available to authorized entities
➢ Integrity: maintaining and assuring the accuracy of the data
➢ Availability: data available when needed
▪ Encryption and hashing key technologies
20
Encryption
▪ Encode a message or information in such a way that only
authorized persons can access it.
▪ Historically
• Caeser shift cipher:
➢ Replacing letter by a letter some number down the alphabet
➢ Julius Caesar used 3-right shift cipher: ETHICS → HWKLFV
• Spartans used Scytale:
➢ Need a rod of given diameter
• Ancient Greek:
➢ Shave head of messenger, write message on head, letting hair
grow back, send messenger
21
Encryption
▪ Historically
• Enigma
➢ Electro-mechanical machine used by Germans in WW II
➢ State defined by setting of rotors and plugs
➢ Each typed letter changes the state of the machine and outputs
some other letter
➢ Only if two machines start in the same state will the same letters
be output
➢ Initial state written on secret page sheet (at U-boot and at HQ)
➢ 1016 possible states, everyday another one chosen
➢ + random initial word to start with
▪ Not random ➔ uniform distribution of letters lost and machine could
be reverse engineered
22
Encryption
▪ Historically
• Enigma
▪ Not random ➔ uniform distribution of letters lost and machine could
be reverse engineered
https://www.youtube.com/watch?v=Fg85ggZSHMw 23
Symmetric encryption
▪ One key used for encryption and decryption
Plain Text: Plain Text:
“Hello Bob” “Hello Bob”
Alice Bob
Secret 1 Encryption Decryption 3 Secret
Cipher Text: 2 Cipher Text:
Sà!3Lksd( Sà!3Lksd(
Sà!3Lksd(
???
Eve 24
Symmetric encryption
▪ One key used for encryption and decryption
• Caesar cipher: 3-shift right
• Weakness: frequency of letters and starting/ending words
(“Dear”, “Yours sincerely”, etc.) or brute force attack
Plain Text: Plain Text:
“Hello Bob” “Hello Bob”
Alice Bob
Secret 1 Encryption Decryption 3 Secret
Cipher Text: 2 Cipher Text:
Khoor Ere Khoor Ere
Khoor Ere
???
Eve 25
Symmetric encryption
▪ DES: Data Encryption Standard
• One of the first major standards in symmetric key encr.
• 56 bit key
• 256 = 7 x 106 possible keys
• Flaw: too small as brute force attack would find key
▪ AES: Advanced Encryption Standard
• By Belgians Vincent Rijmen and Joan Daemen (1988)
• 128, 192 or 256 bit keys
• 2128 = 3 x 1038 possible keys, considered safe in current age
• New standard since late 90s
▪ Challenges
• How to share keys: unsecure or overhead
• How to manage keys: if u users need to communicate with one
another → need for (u-1) + (u-2) + … + 1 = u x (u-1) / 2 keys to be
shared before communicating 26
Asymmetric encryption
▪ Two keys: public and private key
• Public key: revealed to the world
• Private key: kept secret at one party
Plain Text: Plain Text:
“Hello Bob” “Hello Bob”
Alice Bob
Bob’s 1 Encryption Decryption 3 Bob’s
Public Key Private Key
Cipher Text: 2 Cipher Text:
Sà!3Lksd( Sà!3Lksd(
Sà!3Lksd(
???
Eve 27
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Popular algorithm for asymmetric encryption
• Principle:
➢ Multiplying two large numbers is easy and fast
➢ Decomposing a large number into prime numbers: very difficult
➢ For example: 19 x 13 = ? Decompose 391 in 2 prime numbers?
➢ If numbers large enough: non efficient (non-quantum) integer
factorisation algorithm exists.
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 35
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e mod ( (p-1)x(q-1) ) = 1 28
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 35
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ]
▪ m = cd mod n
▪ Advantages
• Sharing keys: only public ones need to be shared (no need for secrecy)
• Manage keys: share only u keys among u users
▪ Disadvantage
• Takes more time than symmetric encryption
29
Asymmetric Encryption
▪ RSA: Rivest, Shamir, Adleman (1983)
• Given p and q large prime numbers
➢ n = p x q, make this the public key
➢ Encrypt message m, using some number e, also made public
▪ e chosen such that prime relative to (p-1) x (q-1), e.g. 5
▪ c = me mod n
➢ Decrypt c, who knows private key p and q
▪ d chosen such that d x e = 1 mod [ (p-1)x(q-1) ] so d x e = [k x (p-1)x(q-1) ] + 1
▪ m = cd mod n
Can only calculate this if you know the
▪ Example decomposition into the prime factors p and q
• p = 7, q = 3 ➔ n = ?
• Message is the letter “l” => 12th letter in the alphabet: so m = 12
• c = ? [3]
• For a k = 2, d = ?
• m=?
30
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key
• Subsequently symmetric encryption with agreed secret key
2
c
Generate random number r Decrypt c with own private key to r
Encrypt r with public key of server to c
3
1
4
All subsequent communication using fast
symmetric encryption, using secret key r
31
Encryption for data protection
▪ Online communication between server and client
• Initially asymmetric encryption to agree on secret key
• Subsequently symmetric encryption with agreed secret key
▪ Used in SSL and TLS protocols, widely used online
• Public key infrasturcture: 3rd party Certificate Authority (CA),
such as Comodo, Let’s Encrypt
• https:// indicates this type of encryption working in the
background
32
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, cars, etc.
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ Revealed that some data is stored unencrypted
33
Encryption for data protection
▪ Whenever you need to communicate online:
• Use TLS protocol
• Even considered to determine ranking in search engine
▪ Whenever you store personal data:
• Encrypt the data
• Beyond USBs, laptops, PCs and smartphones: bikes, car makers
• Cars:
➢ Personal data: address, routes, contacts, etc.
➢ FTC (US): advice to clean out personal data before selling car
➢ Tesla:
▪ Personal data is hidden when given the keys to a valet (valet mode)
▪ 2018: som
https://www.cnbc.com/2019/03/29/tesla-model-3-keeps-data-like-crash-videos-location-phone-contacts.html
34
Presentation and Paper Ideas
▪ Who cares about privacy, and what aspects?
▪ Camdridge Analytica: Who (countries, parties, persons)
made use of what data and services?
▪ Recent fines on GDPR, and lessons learnt
▪ Examples of use cases where legitimate interest is used as
reason to process personal data
▪ Recent advances in encryption
▪ The history of encryption
▪ A comparison of privacy laws in Europe, US and China
▪ Challenges of the AI Act
35