Explain Privacy and Data Sensitivity
Concepts
A detailed understanding of privacy and data sensitivity concepts will help you to operate within
an overall data governance team. Data security and privacy are areas where policy and
procedure are as important as technical controls in ensuring compliance. These policies and
procedures may also need to be expressed in agreements with external partners, suppliers, and
customers. As a security professional, you will need to select and apply these policies,
procedures, and agreements wisely.
PRIVACY AND SENSITIVE DATA
CONCEPTS
The value of information assets can be thought of in terms of how a compromise of the data's
security attributes of the confidentiality, integrity, and availability (CIA) triad would impact the
organization. When surveying information within an organization, it is important not to solely judge
how secretly it might need to be kept, but how the data is used within workflows. For example, the
risk to confidentiality of public information is nonexistent. The risk to availability, however, could
have significant impacts on workflows.
Data must be kept securely within a processing and storage system that enforces CIA attributes.
In practice, this will mean a file or database management system that provides read or read/write
access to authorized and authenticated accounts or denies access otherwise (by being
encrypted, for instance). As distinct from this security requirement, you also need to consider the
impact of privacy in shaping data governance.
Privacy versus Security
While data security is important, privacy is an equally vital factor. Privacy is a data governance
requirement that arises when collecting and processing personal data. Personal data is any
information about an identifiable individual person, referred to as the data subject. Where data
security controls focus on the CIA attributes of the processing system, privacy requires policies to
identify private data, ensure that storage, processing, and retention is compliant with relevant
regulations, limit access to the private data to authorized persons only, and ensure the rights of
data subjects to review and remove any information held about them are met.
Information Life Cycle Management
An information life cycle model identifies discrete steps to assist security and privacy policy
design. Most models identify the following general stages:
       Creation/collection—data may be generated by an employee or automated system, or it
        may be submitted by a customer or supplier. At this stage, the data needs to be classified
        and tagged.
       Distribution/use—data is made available on a need to know basis for authorized uses by
        authenticated account holders and third parties.
       Retention—data might have to be kept in an archive past the date when it is still used for
        regulatory reasons.
       Disposal—when it no longer needs to be used or retained, media storing data assets
        must be sanitized to remove any remnants.
Information management is a massive task in any organization. Most schemes focus on
structured data (that is, information that is stored in a directory hierarchy and subject to
administrative access controls). Managing and classifying unstructured data (emails, chat
sessions, telephone calls, and so on) is an even more daunting task, though software
solutions designed to tackle this problem are available.
DATA ROLES AND RESPONSIBILITIES
A data governance policy describes the security controls that will be applied to protect data at
each stage of its life cycle. There are important institutional governance roles for oversight and
management of information assets within the life cycle:
       Data owner—a senior (executive) role with ultimate responsibility for maintaining the
        confidentiality, integrity, and availability of the information asset. The owner is responsible
        for labeling the asset (such as determining who should have access and determining the
        asset's criticality and sensitivity) and ensuring that it is protected with appropriate controls
        (access control, backup, retention, and so forth). The owner also typically selects a
        steward and custodian and directs their actions and sets the budget and resource
        allocation for sufficient controls.
       Data steward—this role is primarily responsible for data quality. This involves tasks such
        as ensuring data is labeled and identified with appropriate metadata and that data is
        collected and stored in a format and with values that comply with applicable laws and
        regulations.
       Data custodian—this role handles managing the system on which the data assets are
        stored. This includes responsibility for enforcing access control, encryption, and
        backup/recovery measures.
       Data Privacy Officer (DPO)—this role is responsible for oversight of any personally
        identifiable information (PII) assets managed by the company. The privacy officer ensures
        that the processing, disclosure, and retention of PII complies with legal and regulatory
        frameworks.
In the context of legislation and regulations protecting personal privacy, the following two
institutional roles are important:
       Data controller—the entity responsible for determining why and how data is stored,
        collected, and used and for ensuring that these purposes and means are lawful. The data
        controller has ultimate responsibility for privacy breaches, and is not permitted to transfer
        that responsibility.
       Data processor—an entity engaged by the data controller to assist with technical
        collection, storage, or analysis tasks. A data processor follows the instructions of a data
        controller with regard to collection or processing.
Data controller and processor tend to be organizational roles rather than individual ones. For
example, if Widget.foo collects personal data to operate a webstore on its own cloud, it is a data
collector and data processor. If Widget.foo passes aggregate data to Grommet.foo asking them to
run profitability analytics for different customer segments on its AI-backed cloud, Grommet.foo is
a data processor acting under the instruction of Widget.foo. Within the Grommet.foo and
Widget.foo companies, the data owner might take personal responsibility for the lawful
performance of data controller and processor functions.
DATA CLASSIFICATIONS
Data classification and typing schemas tag data assets so that they can be managed through
the information life cycle. A data classification schema is a decision tree for applying one or more
tags or labels to each data asset. Many data classification schemas are based on the degree of
confidentiality required:
       Public (unclassified)—there are no restrictions on viewing the data. Public information
        presents no risk to an organization if it is disclosed but does present a risk if it is modified
        or not available.
       Confidential (secret)—the information is highly sensitive, for viewing only by approved
        persons within the owner organization, and possibly by trusted third parties under NDA.
       Critical (top secret)—the information is too valuable to allow any risk of its capture.
        Viewing is severely restricted.
Another type of classification schema identifies the kind of information asset:
       Proprietary—proprietary information or intellectual property (IP) is information created
        and owned by the company, typically about the products or services that they make or
        perform. IP is an obvious target for a company's competitors, and IP in some industries
        (such as defense or energy) is of interest to foreign governments. IP may also represent a
        counterfeiting opportunity (movies, music, and books, for instance).
       Private/personal data—information that relates to an individual identity.
       Sensitive—this label is usually used in the context of personal data is privacy-sensitive
        information about a subject that could harm them if made public and could prejudice
        decisions made about them if referred to by internal procedures. As defined by the EU's
        General Data Protection Regulations (GDPR), sensitive personal data includes religious
        beliefs, political opinions, trade union membership, gender, sexual orientation, racial or
        ethnic origin, genetic data, and health information.
DATA TYPES
A type schema applies a more detailed label to data than simple classification.
Personally Identifiable Information (PII)
Personally identifiable information (PII) is data that can be used to identify, contact, or locate
an individual. A Social Security Number (SSN) is a good example of PII. Others include name,
date of birth, email address, telephone number, street address, biometric data, and so on. Some
bits of information, such as a SSN, may be unique; others uniquely identify an individual in
combination (for example, full name with birth date and street address).
Some types of information may be PII depending on the context. For example, when someone
browses the web using a static IP address, the IP address is PII. An address that is dynamically
assigned by the ISP may not be considered PII. PII is often used for password reset mechanisms
and to confirm identity over the telephone. For example, PII may be defined as responses to
challenge questions, such as "What is your favorite color/pet/movie?" These are the sort of
complexities that must be considered when laws are introduced to control the collection and
storage of personal data.
Customer Data
Customer data can be institutional information, but also personal information about the
customer's employees, such as sales and technical support contacts. This personal customer
data should be treated as PII. Institutional information might be shared under a nondisclosure
agreement (NDA), placing contractual obligations on storing and processing it securely.
Health Information
Personal health information (PHI)—or protected health information—refers to medical and
insurance records, plus associated hospital and laboratory test results. PHI may be associated
with a specific person or used as an anonymized or deidentified data set for analysis and
research. An anonymized data set is one where the identifying data is removed completely. A
deidentified set contains codes that allow the subject information to be reconstructed by the data
provider.
PHI trades at high values on the black market, making it an attractive target. Criminals seek to
exploit the data for insurance fraud or possibly to blackmail victims. PHI data is extremely
sensitive and its loss has a permanent effect. Unlike a credit card number or bank account
number, it cannot be changed. Consequently, the reputational damage that would be caused by a
PHI data breach is huge.
Financial Information
Financial information refers to data held about bank and investment accounts, plus information
such as payroll and tax returns. Payment card information comprises the card number, expiry
date, and the three-digit card verification value (CVV). Cards are also associated with a PIN, but
this should never be transmitted to or handled by the merchant. Abuse of the card may also
require the holder's name and the address the card is registered to. The Payment Card Industry
Data Security Standard (PCI DSS) defines the safe handling and storage of this information.
Government Data
Internally, government agencies have complex data collection and processing requirements. In
the US, federal laws place certain requirements on institutions that collect and process data about
citizens and taxpayers. This data may be shared with companies for analysis under strict
agreements to preserve security and privacy.
PRIVACY NOTICES AND DATA
RETENTION
Data owners should be aware of any legal or regulatory issues that impact collection and
processing of personal data. The right to privacy, as enacted by regulations such as the EU's
General Data Protection Regulation (GDPR), means that personal data cannot be collected,
processed, or retained without the individual's informed consent. GDPR gives data subjects rights
to withdraw consent, and to inspect, amend, or erase data held about them.
Privacy Notices
Informed consent means that the data must be collected and processed only for the stated
purpose, and that purpose must be clearly described to the user in plain language, not legalese.
This consent statement is referred to as a privacy notice. Data collected under that consent
statement cannot then be used for any other purpose. For example, if you collect an email
address for use as an account ID, you may not send marketing messages to that email address
without obtaining separate consent for that discrete purpose. Purpose limitation will also restrict
your ability to transfer data to third parties.
Impact Assessments
Tracking consent statements and keeping data usage in compliance with the consent granted is a
significant management task. In organizations that process large amounts of personal data,
technical tools that perform tagging and cross-referencing of personal data records will be
required. A data protection impact assessment is a process designed to identify the risks of
collecting and processing personal data in the context of a business workflow or project and to
identify mechanisms that mitigate those risks.
Data Retention
Data retention refers to backing up and archiving information assets in order to comply with
business policies and/or applicable laws and regulations. To meet compliance and e-discovery
requirements, organizations may be legally bound to retain certain types of data for a specified
period. This type of requirement will particularly affect financial data and security log data.
Conversely, storage limitation principles in privacy legislation may prevent you from retaining
personal data for longer than is necessary. This can complicate the inclusion of PII in backups
and archives.
DATA SOVEREIGNTY AND
GEOGRAPHICAL CONSIDERATIONS
Some states and nations may respect data privacy more or less than others; and likewise, some
nations may disapprove of the nature and content of certain data. They may even be suspicious
of security measures such as encryption. When your data is stored or transmitted in other
jurisdictions, or when you collect data from citizens in other states or other countries, you may not
"own" the data in the same way as you'd expect or like to.
Data Sovereignty
Data sovereignty refers to a jurisdiction preventing or restricting processing and storage from
taking place on systems do not physically reside within that jurisdiction. Data sovereignty may
demand certain concessions on your part, such as using location-specific storage facilities in a
cloud service.
For example, GDPR protections are extended to any EU citizen while they are within EU or EEA
(European Economic Area) borders. Data subjects can consent to allow a transfer but there must
be a meaningful option for them to refuse consent. If the transfer destination jurisdiction does not
provide adequate privacy regulations (to a level comparable to GDPR), then contractual
safeguards must be given to extend GDPR rights to the data subject. In the US, companies can
self-certify that the protections they offer are adequate under the Privacy Shield scheme
Geographical Considerations
Geographic access requirements fall into two different scenarios:
Storage locations might have to be carefully selected to mitigate data sovereignty issues. Most
cloud providers allow choice of data centers for processing and storage, ensuring that information
is not illegally transferred from a particular privacy jurisdiction without consent.
Employees needing access from multiple geographic locations. Cloud-based file and database
services can apply constraint-based access controls to validate the user's geographic location
before authorizing access.
PRIVACY BREACHES AND DATA
BREACHES
A data breach occurs when information is read, modified, or deleted without authorization.
"Read" in this sense can mean either seen by a person or transferred to a network or storage
media. A data breach is the loss of any type of data (but notably corporate information and
intellectual property), while a privacy breach refers specifically to loss or disclosure of personal
and sensitive data.
Organizational Consequences
A data or privacy breach can have severe organizational consequences:
       Reputation damage—data breaches cause widespread negative publicity, and customers
        are less likely to trust a company that cannot secure its information assets.
       Identity theft—if the breached data is exploited to perform identity theft, the data subject
        may be able to sue for damages.
       Fines—legislation might empower a regulator to levy fines. These can be fixed sum or in
        the most serious cases a percentage of turnover.
       IP theft—loss of company data can lead to loss of revenue. This typically occurs when
        copyright material—unreleased movies and music tracks—is breached. The loss of
        patents, designs, trade secrets, and so on to competitors or state actors can also cause
        commercial losses, especially in overseas markets where IP theft may be difficult to
        remedy through legal action.
Notifications of Breaches
The requirements for different types of breach are set out in law and/or in regulations. The
requirements indicate who must be notified. A data breach can mean the loss or theft of
information, the accidental disclosure of information, or the loss or damage of information. Note
that there are substantial risks from accidental breaches if effective procedures are not in place. If
a database administrator can run a query that shows unredacted credit card numbers, that is a
data breach, regardless of whether the query ever leaves the database server.
Escalation
A breach may be detected by technical staff and if the event is considered minor, there may be a
temptation to remediate the system and take no further notification action. This could place the
company in legal jeopardy. Any breach of personal data and most breaches of IP should be
escalated to senior decision-makers and any impacts from legislation and regulation properly
considered.
Public Notification and Disclosure
Other than the regulator, notification might need to be made to law enforcement, individuals and
third-party companies affected by the breach, and publicly through press or social media
channels. For example, the Health Insurance Portability and Accountability Act (HIPAA) sets
out reporting requirements in legislation, requiring breach notification to the affected individuals,
the Secretary of the US Department of Health and Human Services, and, if more than 500
individuals are affected, to the media . The requirements also set out timescales for when these
parties should be notified. For example, under GDPR, notification must be made within 72 hours
of becoming aware of a breach of personal data . Regulations will also set out disclosing
requirements, or the information that must be provided to each of the affected parties. Disclosure
is likely to include a description of what information was breached, details for the main point-of-
contact, likely consequences arising from the breach, and measures taken to mitigate the breach.
GDPR offers stronger protections than most federal and state laws in the US, which tend to focus
on industry-specific regulations, narrower definitions of personal data, and fewer rights and
protections for data subjects. The passage of the California Consumer Privacy Act (CCPA) has
changed the picture for domestic US legislation, however.
DATA SHARING AND PRIVACY TERMS
OF AGREEMENT
It is important to remember that although one can outsource virtually any service or activity to a
third party, one cannot outsource legal accountability for these services or actions. You are
ultimately responsible for the services and actions that these third parties take. If they have any
access to your data or systems, any security breach in their organization (for example,
unauthorized data sharing) is effectively a breach in yours. Issues of security risk awareness,
shared duties, and contractual responsibilities can be set out in a formal legal agreement. The
following types of agreements are common:
       Service level agreement (SLA)—a contractual agreement setting out the detailed terms
        under which a service is provided. This can include terms for security access controls and
        risk assessments plus processing requirements for confidential and private data.
       Interconnection security agreement (ISA)—ISAs are defined by NIST's SP800-47
        "Security Guide for Interconnecting Information Technology Systems". Any federal
        agency interconnecting its IT system to a third party must create an ISA to govern the
        relationship. An ISA sets out a security risk awareness process and commits the agency
        and supplier to implementing security controls.
       Nondisclosure agreement (NDA)—legal basis for protecting information assets. NDAs are
        used between companies and employees, between companies and contractors, and
        between two companies. If the employee or contractor breaks this agreement and does
        share such information, they may face legal consequences. NDAs are useful because
        they deter employees and contractors from violating the trust that an employee places in
        them.
       Data sharing and use agreement—under privacy regulations such as GDPR or HIPAA,
        personal data can only be collected for a specific purpose. Data sets can be subject to
        pseudo-anonymization or deidentification to remove personal data, but there are risks of
        reidentification if combined with other data sources. A data sharing and use agreement is
        a legal means of preventing this risk. It can specify terms for the way a data set can be
        analyzed and proscribe the use of reidentification techniques.
Explain Privacy and Data Protection
Controls
Policies and procedures are essential for effective data governance, but they can be supported
by technical controls too. As a security professional, you need to be aware of the capabilities of
data loss prevention (DLP) systems and privacy enhancing database controls, and how they can
be used to protect data anywhere it resides, on hosts, in email systems, or in the cloud.
DATA PROTECTION
Data stored within a trusted OS can be subject to authorization mechanisms where the OS
mediates access using some type of ACL. The presence of a trusted OS cannot always be
assumed, however. Other data protection mechanisms, notably encryption, can be used to
mitigate the risk that an authorization mechanism can be countermanded. When deploying a
cryptographic system to protect data assets, consideration must be given to all the ways that
information could potentially be intercepted. This means thinking beyond the simple concept of a
data file stored on a disk. Data can be described as being in one of three states:
      Data at rest—this state means that the data is in some sort of persistent storage media.
       Examples of types of data that may be at rest include financial information stored in
       databases, archived audiovisual media, operational policies and other management
       documents, system configuration data, and more. In this state, it is usually possible to
       encrypt the data, using techniques such as whole disk encryption, database encryption,
       and file- or folder-level encryption. It is also possible to apply permissions—access control
       lists (ACLs)—to ensure only authorized users can read or modify the data. ACLs can be
       applied only if access to the data is fully mediated through a trusted OS.
      Data in transit (or data in motion)—this is the state when data is transmitted over a
       network. Examples of types of data that may be in transit include website traffic, remote
       access traffic, data being synchronized between cloud repositories, and more. In this
       state, data can be protected by a transport encryption protocol, such as TLS or IPSec.
With data at rest, there is a greater encryption challenge than with data in-transit as the
encryption keys must be kept secure for longer. Transport encryption can use ephemeral
(session) keys.
       Data in use (or data in processing)—this is the state when data is present in volatile
        memory, such as system RAM or CPU registers and cache. Examples of types of data
        that may be in use include documents open in a word processing application, database
        data that is currently being modified, event logs being generated while an operating
        system is running, and more. When a user works with data, that data usually needs to be
        decrypted as it goes from in rest to in use. The data may stay decrypted for an entire work
        session, which puts it at risk. However, trusted execution environment (TEE)
        mechanisms, such as Intel Software Guard Extensions are able to encrypt data as it
        exists in memory, so that an untrusted process cannot decode the information.
DATA EXFILTRATION
In a workplace where mobile devices with huge storage capacity proliferate and high bandwidth
network links are readily available, attempting to prevent the loss of data by controlling the types
of storage devices allowed to connect to PCs and networks can be impractical. Unauthorized
copying or retrieval of data from a system is referred to as data exfiltration. Data exfiltration
attacks are one of the primary means for attackers to retrieve valuable data, such as personally
identifiable information (PII) or payment information, often destined for later sale on the black
market. Data exfiltration can take place via a wide variety of mechanisms, including:
       Copying the data to removable media or other device with storage, such as USB drive,
        the memory card in a digital camera, or a smartphone.
       Using a network protocol, such as HTTP, FTP, SSH, email, or Instant Messaging
        (IM)/chat. A sophisticated adversary might use a Remote Access Trojan (RAT) to perform
        transfer of data over a nonstandard network port or a packet crafter to transfer data over
        a standard port in a nonstandard way. The adversary may also use encryption to disguise
        the data being exfiltrated.
       By communicating it orally over a telephone, cell phone, or Voice over IP (VoIP) network.
        Cell phone text messaging is another possibility.
       Using a picture or video of the data—if text information is converted to an image format it
        is very difficult for a computer-based detection system to identify the original information
        from the image data.
While some of these mechanisms are simple to mitigate through the use of security tools, others
may be much less easily defeated. You can protect data using mechanisms and security controls
that you have examined previously:
       Ensure that all sensitive data is encrypted at rest. If the data is transferred outside the
        network, it will be mostly useless to the attacker without the decryption key.
       Create and maintain offsite backups of data that may be targeted for destruction or
        ransom.
       Ensure that systems storing or transmitting sensitive data are implementing access
        controls. Check to see if access control mechanisms are granting excessive privileges to
        certain accounts.
       Restrict the types of network channels that attackers can use to transfer data from the
        network to the outside. Disconnect systems storing archived data from the network.
       Train users about document confidentiality and the use of encryption to store and transmit
        data securely. This should also be backed up by HR and auditing policies that ensure
        staff are trustworthy.
Even if you apply these policies and controls diligently, there are still risks to data from insider
threats and advanced persistent threat (APT) malware. Consequently, a class of security control
software has been developed to apply access policies directly to data, rather than just the host or
network on which data is located.
DATA LOSS PREVENTION
To apply data guardianship policies and procedures, smaller organizations might classify and
type data manually. An organization that creates and collects large amounts of personal data will
usually need to use automated tools to assist with this task, however. There may also be a
requirement to protect valuable intellectual property (IP) data. Data loss prevention
(DLP) products automate the discovery and classification of data types and enforce rules so that
data is not viewed or transferred without a proper authorization. Such solutions will usually
consist of the following components:
       Policy server—to configure classification, confidentiality, and privacy rules and policies,
        log incidents, and compile reports.
       Endpoint agents—to enforce policy on client computers, even when they are not
        connected to the network.
      Network agents—to scan communications at network borders and interface with web and
       messaging servers to enforce policy.
DLP agents scan content in structured formats, such as a database with a formal access control
model or unstructured formats, such as email or word processing documents. A file cracking
process is applied to unstructured data to render it in a consistent scannable format. The transfer
of content to removable media, such as USB devices, or by email, instant messaging, or even
social media, can then be blocked if it does not conform to a predefined policy. Most DLP
solutions can extend the protection mechanisms to cloud storage services, using either a proxy to
mediate access or the cloud service provider's API to perform scanning and policy enforcement.
Remediation is the action the DLP software takes when it detects a policy violation. The following
remediation mechanisms are typical:
      Alert only—the copying is allowed, but the management system records an incident and
       may alert an administrator.
      Block—the user is prevented from copying the original file but retains access to it. The
       user may or may not be alerted to the policy violation, but it will be logged as an incident
       by the management engine.
      Quarantine—access to the original file is denied to the user (or possibly any user). This
       might be accomplished by encrypting the file in place or by moving it to a quarantine area
       in the file system.
      Tombstone—the original file is quarantined and replaced with one describing the policy
       violation and how the user can release it again.
When it is configured to protect a communications channel such as email, DLP remediation might
take place using client-side or server-side mechanisms. For example, some DLP solutions
prevent the actual attaching of files to the email before it is sent. Others might scan the email
attachments and message contents, and then strip out certain data or stop the email from
reaching its destination.
Some of the leading vendors include McAfee], Symantec/Broadcom, and Digital Guardian
(digitalguardian.com). A DLP and compliance solution is also available with Microsoft's Office 365
suite .
RIGHTS MANAGEMENT SERVICES
As another example of data protection and information management solutions, Microsoft provides
an Information Rights Management (IRM) feature in their Office productivity suite, SharePoint
document collaboration services, and Exchange messaging server. IRM works with the Active
Directory Rights Management Services (RMS) or the cloud-based Azure Information Protection.
These technologies provide administrators with the following functionality:
      Assign file permissions for different document roles, such as author, editor, or reviewer.
      Restrict printing and forwarding of documents, even when sent as file attachments.
      Restrict printing and forwarding of email messages.
Rights management is built into other secure document solutions, such as Adobe Acrobat.
PRIVACY ENHANCING
TECHNOLOGIES
Data minimization is the principle that data should only be processed and stored if that is
necessary to perform the purpose for which it is collected. In order to prove compliance with the
principle of data minimization, each process that uses personal data should be documented. The
workflow can supply evidence of why processing and storage of a particular field or data point is
required. Data minimization affects the data retention policy. It is necessary to track how long a
data point has been stored for since it was collected and whether continued retention supports a
legitimate processing function. Another impact is on test environments, where the minimization
principle forbids the use of real data records.
Counterintuitively, the principle of minimization also includes the principle of sufficiency or
adequacy. This means that you should collect the data required for the stated purpose in a single
transaction to which the data subject can give clear consent. Collecting additional data later
would not be compliant with this principle.
Large data sets are often shared or sold between organizations and companies, especially within
the healthcare industry. Where these data sets contain PII or PHI, steps can be taken to remove
the personal or identifying information. These deidentification processes can also be used
internally, so that one group within a company can receive data for analysis without unnecessary
risks to privacy. Deidentification methods may also be used where personal data is collected to
perform a transaction but does not need to be retained thereafter. This reduces compliance risk
when storing data by applying minimization principles. For example, a company uses a
customer's credit card number to take payment for an order. When storing the order details, it
only keeps the final 4 digits of the card as part of the transaction log, rather than the full card
number.
A fully anonymized data set is one where individual subjects can no longer be identified, even if
the data set is combined with other data sources. Identifying information is permanently removed.
Ensuring full anonymization and preserving the utility of data for analysis is usually very difficult,
however. Consequently, pseudo-anonymization methods are typically used instead. Pseudo-
anonymization modifies or replaces identifying information so that reidentification depends on an
alternate data source, which must be kept separate. With access to the alternated data, pseudo-
anonymization methods are reversible.
It is important to note that given sufficient contextual information, a data subject can be
reidentified, so great care must be taken when applying deidentification methods for distribution
to different sources. A reidentification attack is one that combines a deidentified data set with
other data sources, such as public voter records, to discover how secure the deidentification
method used is.
K-anonymous information is data that can be linked to two or more individuals. This
means that the data does not unambiguously reidentify a specific individual, but there is a
significant risk of reidentification, given the value of K. For example, if k=5, any group that
can be identified within the data set contains at least five individuals. NIST has produced
an overview of deidentification issues, in draft form at the time of writing
DATABASE DEIDENTIFICATION
METHODS
Deidentification methods are usually implemented as part of the database management system
(DBMS) hosting the data. Sensitive fields will be tagged for deidentification whenever a query or
report is run.
Data Masking
Data masking can mean that all or part of the contents of a field are redacted, by substituting all
character strings with "x" for example. A field might be partially redacted to preserve metadata for
analysis purposes. For example, in a telephone number, the dialing prefix might be retained, but
the subscriber number redacted. Data masking can also use techniques to preserve the original
format of the field. Data masking is an irreversible deidentification technique.
Tokenization
Tokenization means that all or part of data in a field is replaced with a randomly generated token.
The token is stored with the original value on a token server or token vault, separate to the
production database. An authorized query or app can retrieve the original value from the vault, if
necessary, so tokenization is a reversible technique. Tokenization is used as a substitute for
encryption, because from a regulatory perspective an encrypted field is the same value as the
original data.
Aggregation/Banding
Another deidentification technique is to generalize the data, such as substituting a specific age
with a broader age band.
Hashing and Salting
A cryptographic hash produces a fixed-length string from arbitrary-length plaintext data using an
algorithm such as SHA. If the function is secure, it should not be possible to match the hash back
to a plaintext. Hashing is mostly used to prove integrity. If two sources have access to the same
plaintext, they should derive the same hash value. Hashing is used for two main purposes within
a database:
       As an indexing method to speed up searches and provide deidentified references to
        records.
       As a storage method for data such as passwords where the original plaintext does not
        need to be retained.
A salt is an additional value stored with the hashed data field. The purpose of salt is to frustrate
attempts to crack the hashes. It means that the attacker cannot use pre-computed tables of
hashes using dictionaries of plaintexts. These tables have to be recompiled to include the salt
value.