GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES
Published 25 January 2018
2
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
TABLE OF CONTENTS
PART 1: OVERVIEW .....................................................................................................................3
1 Introduction ...................................................................................................................3
2 Purpose and Scope of This Guide...................................................................................3
3 Terminology ...................................................................................................................6
PART 2: BACKGROUND................................................................................................................8
4 Data Anonymisation Concepts.......................................................................................8
5 Disclosure risks.............................................................................................................11
PART 3: BASIC DATA ANONYMISATION TECHNIQUES..............................................................12
6 Attribute Suppression ..................................................................................................12
7 Record Suppression......................................................................................................13
8 Character Masking .......................................................................................................13
9 Pseudonymisation........................................................................................................15
10 Generalisation..............................................................................................................18
11 Swapping......................................................................................................................20
12 Data Perturbation ........................................................................................................21
13 Synthetic Data..............................................................................................................22
14 Data Aggregation .........................................................................................................25
PART 4: PUTTING IT TOGETHER ................................................................................................26
15 Anonymisation Methodology.......................................................................................26
16 K-anonymity – a measure of risk..................................................................................28
17 Assessing the Risk of Re-Identification.........................................................................30
18 Technical Controls........................................................................................................33
19 Governance..................................................................................................................34
20 Acknowledgements......................................................................................................35
Annex A: Summary of Anonymisation Techniques...................................................................37
Annex B: Main References........................................................................................................38
3
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
PART 1: OVERVIEW
1 Introduction
1.1. The collection, use and disclosure of individuals’ personal data by organisations in
Singapore is governed by the Personal Data Protection Act 2012 (the “PDPA”). The
Personal Data Protection Commission (“PDPC”) was established to enforce the PDPA
and promote awareness of protection of personal data in Singapore.
2 Purpose and Scope of This Guide
2.1. This Guide seeks to provide a general introduction to the technical aspects of
anonymisation1. It should be read together with Chapter 3 (Anonymisation) of the
PDPC’s Advisory Guidelines on the PDPA for Selected Topics (“Advisory Guidelines”),
which sets out PDPC’s interpretation and considerations for determining what
constitutes “anonymisation” under the PDPA.
2.2. The basic concepts and techniques discussed in this Guide make reference to the terms
“data anonymisation”, and “anonymised data”. “Data anonymisation” refers to the
conversion of personal data into “anonymised data” by applying a range of
“anonymisation techniques”. “Anonymised data”, for the purposes of this Guide, refers
to data that has undergone transformation by anonymisation techniques in
combination with assessment of the risk of re-identification. Typically, the process of
data anonymisation would be “irreversible” and the recipient of the anonymised
dataset would not be able to recreate the original data. However, there may be cases
where the organisation applying the anonymisation retains the ability to recreate the
original data from the anonymised data; in such cases, the anonymisation process is
“reversible”.
2.3. In this Guide, the terms “data anonymisation” and “anonymised data” are intended to
be understood generically and aligned to the technical literature on this topic. They are
not intended to be understood in the same way as the terms used in the Advisory
Guidelines, nor give determinative legal effect to the data that has undergone
transformation by anonymisation techniques. The following diagram provides a
pictorial summary of the data anonymisation concept in the Advisory Guidelines:
1
To avoid misunderstanding, anonymisation in this Guide refers to the transformation of existing data already
available to an Organisation. It does not refer to the aspect of “anonymity” of individuals, where individuals
attempt to hide their identity from being known.
4
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
(where PD = Personal Data)
For more information on the PDPC’s interpretation of “anonymisation” and
“anonymised data”, please refer to the Advisory Guidelines.
2.4. The intent of this Guide is to provide information on techniques that could be applied
in anonymising data. This Guide primarily addresses organisations which do not intend
to release the anonymised data into the public domain, but who share data with other
organisations or entities, where additional administrative and technical controls may
be imposed to reduce the risk of unauthorised disclosure of personal data. Application
of these techniques may not necessarily ensure that the data does not pose any serious
risk of re-identification and therefore constitutes “anonymised data” to which the PDPA
does not apply.
2.5. This Guide is not a substitute for professional training, literature and services. Unless
Organisations are familiar with the risks and countermeasures, it is recommended for
Organisations, when disclosing anonymised data – especially if the disclosure is
intended for release into the public domain or the release involves multiple datasets or
updates of anonymised data over time – to seek professional advice or services for data
anonymisation.
2.6. This Guide describes anonymisation techniques for static, structured, well-defined,
textual, and single-level datasets, whereby:
 “Static” refers to the fact that the data is fully available at the time of
anonymisation; this is in contrast to streaming data, where relationships
between data may not be fully established because streaming constantly
No
YesNo
Yes
Apply
anonymisation
techniques and
assess risk of re-
identification
Anonymised
Data
PD
Remains as
PD
Start
End
Nil or
very
low
risk?
Con-
tinue
efforts
?
Anonymise
data further
using
anonymisation
techniques
and/or
Apply admin/
technical/ legal
controls to
lower risk
5
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
provides new data. Hence, streaming data may need other anonymisation
techniques than those discussed in this Guide.
 “Structured” refers to the fact that the anonymisation technique is applied to
data within a known format and a known location within the data pool.
“Structured” is therefore not limited to data in a tabular format like in a
spreadsheet or a relational database, but may be held or released in other
defined formats, for example XML, CSV, JSON, etc. This Guide describes the
techniques and provides examples in the more common tabular format, but this
does not imply that the techniques only apply to tabular format.
 “Well-defined” refers to the fact that the original dataset conforms to predefined
rules. E.g. data from relational databases tend to be more well-defined.
Anonymising datasets which are not well-defined may create additional
challenges to data anonymisation, and is outside the scope of this Guide.
 “Textual” refers to text, numbers, dates, etc., that is, alphanumeric data already
in digital form. Anonymisation techniques for streaming data like audio, video,
images, big data (in its raw form), geolocation, bio-metrics etc. create additional
challenges and require entirely different anonymisation techniques, which are
outside the scope of this Guide.
 “Single-level” refers to data pertaining to different individuals. Datasets which
contain multiple entries for the same individuals (e.g. different transactions
done by an individual) may still use some of the techniques explained in this
Guide, but additional criteria may need to be applied; such criteria are outside
the scope of this Guide.
2.7. This Guide is for persons who are responsible for data protection within an organisation,
without prior knowledge or experience in data anonymisation. A basic mathematical
background will be required to understand some of the terminology and concepts used,
and a basic understanding of risk management is needed in the application of the
techniques.
2.8. While this Guide seeks to assist organisations in anonymising personal data, the
Commission recognises that there is no “one size fits all” solution for organisations. Each
organisation should therefore utilise anonymisation approaches that are appropriate
for their circumstances. Some factors that organisations can take into account when
deciding on the anonymisation technique(s) to use include:
 the nature and type of personal data that the organisation intends to
anonymise, as different anonymisation techniques are suitable for different
types of data and circumstances;
6
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
 risk management by the organisation to impose controls to protect the
anonymised data, in addition to the anonymisation techniques;
 the utility required from the anonymised data (refer to section 4 on
anonymisation concepts).
3 Terminology
3.1. Due to the variance of terms and meanings used in literature on the subject of data
anonymisation, this section explains the meaning of some key terms as they are used
in this Guide.
Term Meaning in this Guide
Adversary A party which attempts to re-identify individual(s) from
a dataset that is supposed to be anonymised.
Anonymisation The conversion of personal data into “anonymised
data” by applying a range of anonymisation techniques.
(This guide focusses only on the technical aspects of this
conversion)
Anonymised dataset The resultant dataset after anonymisation technique(s)
has/have been applied in combination with adequate
risk assessment.
Attribute Also referred to as data field, data column or variable.
An information that can be found across the data
records in a dataset. Name, gender and address are
examples of attributes.
Dataset A set of data records. Conceptually similar to a table in
a typical relational database or spreadsheet, having
records (rows) and attributes (columns).
Direct identifier A data attribute that on its own identifies an individual
(e.g. fingerprint) or has been assigned to an individual.
(e.g. NRIC number)
Equivalence class The records in a dataset that share the same values
within certain attributes, typically indirect identifiers.
Identifiability vs
Re-identifiability
The degree to which an individual can be identified from
one or more datasets containing direct and indirect
identifiers, versus the degree to which an individual can
be identified from anonymised dataset(s).
Indirect identifier Also referred to as quasi-identifiers. A data attribute
that, by itself/on its own, does not identify an individual,
but may identify an individual when combined with
other information.
Non-identifier Datasets may contain data attributes which are neither
categorised as direct nor indirect identifiers. Such
attributes need not undergo anonymisation (Note that
7
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
the examples provided in this guide do not include such
attributes, but this does not mean they cannot be part
of the anonymised data)
Original dataset The dataset before any anonymisation technique is
applied.
Pseudonymisation2 The technique of replacing an identifier with an
unrelated yet typically still unique value. E.g. Replacing
“Joshua Quek” with “274927473”
Record Also referred to as a row. A group of information
typically relating to a subject (e.g. an individual) or
transaction.
Re-identification Identifying a person from an anonymised dataset.
Spontaneous re-identification refers to unintended reidentification
due to having special knowledge of
individuals.
Additional notes on terminology
3.2. Chapter 5 of the “Advisory Guidelines On Key Concepts in the PDPA” clarifies what
“identifiers” are. The Guidelines use the term “unique identifier”, which is equivalent
to the term “direct identifier” used in this Guide. The term “direct identifier” is used
instead of “unique identifier” in this Guide as the former is more commonly used in the
area of data anonymisation.
3.3. The Advisory Guidelines do not provide a specific term equivalent to “indirect identifier”,
but explain based on an example that “although each of these data points, on its own,
would not be able to identify an individual”, the organisation “should be mindful that
the dataset” (e.g. data points in combination) “may be able to identify the respondent”.
It also clarifies that “so long as any combination of data contains a unique identifier of
an individual, that combination of data will constitute personal data”.
3.4. Note also that there is no common term in typical anonymisation literature to describe
the third type of data, referred to in this Guide as ‘non-identifiers’. Such non-identifiers
would not be considered Personal Data, if they were isolated from any direct and
indirect identifiers (e.g. not all data is necessarily Personal Data). But once they are
linked to direct or indirect identifiers, they need to be protected and treated just like
Personal Data. As long as the use or appearance of such data within an anonymised
2
Some literature (e.g. “Opinion 05/2014 on Anonymisation Techniques” by the Article 29 Data Protection
Working Party) emphasise the risk of using pseudonyms as an anonymisation technique. In this Guide
pseudonymisation is not excluded from the anonymisation techniques, because it may still serve its purpose
when applied diligently.
8
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
dataset does not violate any of the other PDPA obligations, they need not be further
anonymised, as they would not be able to identify an individual.3
3.5. It is not the intent of this Guide to define which of the three types are Personal Data
under the PDPA and which are not, but for the purpose of discussing anonymisation
techniques, this additional distinction is important, and therefore this Guide follows the
common terminology in anonymisation literature using “direct” and “indirect”
identifiers, and where “data points” are termed as data fields or attributes.
3.6. Similarly, it should be noted that this Guide does not differentiate between “data” and
“metadata”; the techniques can (and where needed, should) be applied to metadata
and any other type of data as well. However, the anonymisation of a specific kind of
meta-data within the dataset itself, namely column header names in spreadsheets or
tags in XML files, is not discussed, as only a few techniques would apply to address this
type of data.
PART 2: BACKGROUND
4 Data Anonymisation Concepts
4.1. Data anonymisation requires a good understanding of the following elements, which
should be taken into consideration when determining suitable anonymisation
techniques and an appropriate anonymisation level:
a. Purpose of anonymisation and utility: The purpose of the anonymisation should be
clear, because anonymisation should be done specifically to the purpose on hand. The
process of anonymisation, regardless of the techniques used, reduces the original
information in the dataset by some extent. And hence, generally, as the extent of
anonymisation increases, the utility (e.g. clarity and/or precision) of the dataset
reduces. Hence the organisation needs to decide on the degree of the trade-off
between acceptable (or expected) utility and trying to reduce the risk of re-
3
For example: A car dealer, for the purpose of utilising Artificial Intelligence and Machine Learning, has very
detailed customer records, e.g. down to the colour of the car purchased and the year of the tyre production.
The car producer wants to determine which default car colour should be produced in more quantity. After
anonymising (e.g. suppressing) the direct and indirect identifiers, the car dealer can share the resulting data
(e.g. containing purchaser’s gender, car colour, tyre production date, etc.) with the car producer without the
need to apply further anonymisation techniques (e.g. no need to generalise the tyre production date or kanonymise
the data set). However, the car dealer can only proceed in this manner when, among others, it is
established that: a) the remaining data in general does not constitute direct or indirect identifiers, b) the use
and sharing of this raw data is not against any of the other obligations (e.g. consent), c) individual, specific
records (e.g. custom produced unique car colour) are removed.
9
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
identification - where the data subject is identified from data that is supposed to be
anonymised.
It should be noted that utility should not be measured on the level of the entire
dataset, but is typically different for different attributes; one extreme is that a specific
attribute is the main item of interest and no generalisation/anonymisation technique
should be applied (e.g. due to data accuracy being crucial), whereas the other extreme
could be that a certain attribute is of no use for the intended purpose and may be
dropped entirely without affecting the utility of the data to the recipient.
Another important consideration in terms of utility is whether it poses an additional
risk if the recipient knows which anonymisation technique and what degree of
granularity have been applied; on the one side it might help the analyst to understand
the results better or interpret them better, but on the other side it might contain hints
which could lead to a higher risk of re-identification (however, some outcomes simply
cannot hide their granularity, e.g. k-anonymity).
b. Characteristics of anonymisation techniques: The different characteristics of the
various anonymisation techniques mean that certain techniques may be more suitable
for a situation than others. For instance, certain techniques (e.g. character masking)
are usually used on direct identifiers and others (e.g. aggregation) for indirect
identifiers. Another example is to consider if the attribute value is a continuous value
or discrete (e.g. “yes” or “no”) value, because techniques like data perturbation work
much better for continuous values.
The various anonymisation techniques also modify data in significantly different ways.
Some modify only parts of an attribute (e.g. character masking); some replace the
value of an attribute across multiple records (e.g. aggregation); some replace the
entire attribute with unrelated, but consistent information (e.g. pseudonymisation);
and some remove the attribute entirely (e.g. attribute suppression).
Some anonymisation techniques can be used in combination. E.g. suppressing or
removing (outlier) records after generalisation is done.
c. Inferred information: It may be possible for certain information to be inferred from
anonymised data. E.g. masking may hide personal data, but it does not hide the length
of the original data in terms of the number of characters.
The problem of inference is not limited to a single attribute, but may also apply across
attributes, even if all have had anonymisation techniques applied. The anonymisation
process must therefore take note of every possibility, both before deciding on the
actual techniques and after applying the techniques.
10
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
The approach may also want to consider in which order the anonymised data is
presented: if the recipient knows that the data records were collected in serial order
(e.g. visitors as they come), it might be prudent (as long as it does not affect the utility)
to reshuffle the entire dataset to avoid inference based on the order of the data
records.
d. Expertise with the subject matter: Anonymisation techniques basically reduce the
“identifiability” of one or more individuals from the original dataset to a level
acceptable by the organisation’s risk portfolio.
An “identifiability” assessment should be performed before and after anonymisation
techniques are applied, and this requires a good understanding of the subject matter
which the data pertains to. The assessment before the anonymisation process ensures
that the structure and information within an attribute is clearly identified and
understood, and the risk of explicit and implicit inference from such data is assessed;
e.g. an attribute containing the year of birth implicitly provides age, to some extent
similar to an NRIC number. The assessment after the anonymisation process will
determine the residual risk of re-identification. Hence, if the dataset is healthcare data,
it likely requires someone with sufficient healthcare knowledge to assess how unique
(i.e. how identifiable) a record is.
Another example is where a synthetic dataset is created or data attributes are
swapped between records, it takes a subject matter expert to recognise if the
anonymised records even make sense.
The right choice of anonymisation techniques therefore depends on the awareness of
the explicit and implicit information contained in the dataset and the amount or type
of information intended to be anonymised.
e. Competency in anonymisation process and techniques: Anonymisation is complex.
Besides having subject matter expertise (as explained above), Organisations wishing
to share anonymised datasets should also ensure that the anonymisation process is
undertaken by persons well-versed in anonymisation techniques and principles. If the
necessary expertise is not found within the Organisation, external help should be
engaged.
f. The recipient: Factors such as the recipients’ expertise with the subject matter,
controls implemented to limit the recipients and to prevent the data from being
shared with unauthorised parties play an important role in the choice of the
anonymisation techniques. In particular, the expected use of the anonymised data by
the recipient may impose limitations on the applied techniques, because the utility of
the data may be lost beyond acceptable limits. Extreme caution need to be taken
11
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
when making public releases of data, and will require a much stronger form of
anonymisation compared to data shared under a contractual arrangement.
g. Tools: Due to the complexity and computation required, software tools can be very
useful to aid in executing anonymisation techniques. There are some dedicated tools4
available, but this Guide does not provide any assessment nor recommendation of
anonymisation or re-identification assessment tools. Note that even the best tools will
need adequate inputs (e.g. appropriate parameters to be used), or may contain
limitations, hence human oversight and familiarity with the tools and data, are still
required.
5 Disclosure risks
5.1. There are various types of disclosure risks. This section explains some fundamental ones
to facilitate further discussion on data anonymisation.
 Identity disclosure (re-identification): determining, with a high level of confidence, the
identity of an individual described by a specific record. This could arise from scenarios
such as insufficient anonymisation, re-identification by linking, or pseudonym reversal.
E.g. an anonymisation process which creates pseudonyms based on as easily
guessable and reversible algorithm, such as replacing ‘1’ with ‘a’, ‘2’ with ‘b’, and so
on.
 Attribute disclosure: determining, with a high level of confidence, that an attribute
described in the dataset belongs to a specific individual, even if the individual’s record
cannot be distinguished. E.g. a dataset containing anonymised client records of a
particular aesthetic surgeon reveals that all his clients below the age of 30 have
undergone a particular procedure. If it is known that a particular individual is 28 years
old and is a client of this surgeon, we then know that this individual has undergone
the particular procedure, even if the individual’s record cannot be distinguished from
others in the anonymised dataset.
 Inference disclosure: making an inference, with a high level of confidence, about an
individual even if he/she is not in the dataset, by statistical properties of the dataset.
E.g. if a dataset released by a medical researcher reveals that 70% of individual aged
above the age of 75 have a certain medical condition, this information could be
inferred on an individual who is not even in the dataset.
5.2. In general, most traditional anonymisation techniques aim to protect against identify
disclosure and not necessarily other types of disclosure risks.
4
Anonymisation tools include ARGUS, sdcMicro, ARX, Privacy Analytics Eclipse, Arcad DOT-Anonymizer
12
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
PART 3: BASIC DATA ANONYMISATION TECHNIQUES
6 Attribute Suppression
6.1. Description: Attribute suppression refers to the removal of an entire part of data (also
referred to as “column” in databases and spreadsheets) in a dataset.
6.2. When to use it: When an attribute is not required in the anonymised dataset, or when
the attribute cannot otherwise be suitably anonymised with another technique. This
technique should be applied at the start of the anonymisation process, as it is an easy
way to decrease identifiability at this point.
6.3. How to use it: Delete (e.g. remove) the attribute(s), or if the structure of the dataset
needs to be maintained, clear the data (and possibly the header). Note that the
suppression should be actual removal (i.e. permanent) , and not just ‘”hiding the
column”5. Similarly, ‘redacting’ may not be sufficient if the underlying data remains
somewhat accessible.
Other tips:
6.4. This is the strongest type of anonymisation technique, because there is no way of
recovering any information from such an attribute.
6.5. In certain scenarios, it may be possible to create a “derived attribute” that provides
utility and yet is less sensitive than the original attribute(s) which can thus be
suppressed. E.g. to create a “duration in premise” attribute based on the “date & time
of entry” and “date and time of exit” attributes.
6.6. Example
In this example, the dataset consists of test scores. As the recipient only needs to
analyse test scores obtained by students with respect to their various trainers but
without analysis on the students themselves, the “student” attribute was removed.
Before anonymisation:
Student Trainer Test Score
John Tina 87
Yong Tina 56
Ming Tina 92
Poh Huang 83
Linnie Huang 45
Jake Huang 67
5
Found in spreadsheet software
13
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
After suppressing the “student” attribute:
Trainer Test Score
Tina 87
Tina 56
Tina 92
Huang 83
Huang 45
Huang 67
7 Record Suppression
7.1. Description: Record suppression refers to the removal of an entire record in a dataset.
In contrast to most other techniques, this technique affects multiple attributes at the
same time.
7.2. When to use it: To remove outlier records which are unique or do not meet other
criteria such as k-anonymity, and not to keep in the anonymised dataset. Outliers can
lead to easy re-identification. It can be applied before or after other techniques (e.g.
generalisation) have been applied.
7.3. How to use it: Delete the entire record. Note that the suppression should be permanent,
and not just a ‘”hide row”6 function; similarly, ‘redacting’ may not be sufficient if the
underlying data remains accessible.
Other tips:
7.4. Refer to the example in the section on generalisation for illustration of how record
suppression is used.
7.5. Note that removal of a record can impact the dataset, e.g. in terms of statistics such as
average and median.
8 Character Masking
8.1. Description: Character masking is the change of the characters of a data value, e.g. by
using a constant symbol (e.g. “*” or “x”). Masking is typically partial, i.e. applied only to
some characters in the attribute.
8.2. When to use it: When the data value is a string of characters and hiding part of it is
sufficient to provide the extent of anonymity required.
6
Found in spreadsheet software
14
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
8.3. How to use it: Depending on the nature of attribute, replace the appropriate characters
with a chosen symbol. Depending on the attribute type, you may decide to replace a
fixed number of characters (e.g. for credit card numbers), or a variable number of
characters (e.g. for email address).
Other tips:
8.4. Note that masking may need to take into account whether the length of the original
data provides information about the original data. Subject matter knowledge is critical
especially for partial masking to ensure the right characters are masked. Special
consideration may also apply to checksums within the data; sometimes the checksum
could be used to recover (other parts of) the masked data. As for complete masking,
the attribute could alternatively be suppressed unless the length of the data is of some
relevance.
8.5. The scenario of masking data in such a way that data subjects are meant to recognise
their own data is a special one, and does not belong to the usual objectives of data
anonymisation. An example of this is the publishing of lucky draw results, whereby
typically the names and partially masked NRIC numbers of lucky draw winners are
published for the individuals to recognise themselves as winners. Note that generally,
anonymised data should not be recognisable even to the data subject themselves.
8.6. Example
This example shows an online grocery store conducting a study of its delivery demand
from historical data, in order to improve operational efficiency. The company masked
out the last 4 digits of the postal codes, leaving the first 2 digits, which correspond to
the “sector code” within Singapore.
Before anonymisation:
Postal Code Favourite Delivery
Time Slot
Average No. of
Orders Per Month
100111 8 pm to 9 pm 2
200222 11 am to 12 noon 8
300333 2 pm to 3pm 1
After partial masking of postal code:
Postal Code Favourite Delivery
Time Slot
Average No. of
Orders Per Month
10xxxx 8 pm to 9 pm 2
20xxxx 11 am to 12 noon 8
30xxxx 2 pm to 3pm 1
15
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
9 Pseudonymisation
Description:
9.1. The replacement of identifying data with made up values. Pseudonymisation is also
referred to as coding. Pseudonyms can be irreversible, where the original values are
properly disposed and the pseudonymisation was done in a non-repeatable fashion, or
reversible (by the owner of the original data), where the original values are securely
kept but can be retrieved and linked back to the pseudonym, should the need arises7.
9.2. Persistent pseudonyms allow linkage by using the same pseudonym values to represent
the same individual across different datasets. On the other hand, different pseudonyms
may be used to represent the same individual in different datasets to prevent linking of
the different datasets.
9.3. Pseudonyms can also be randomly or deterministically generated.
9.4. When to use it: When data values need to be uniquely distinguished and where no
character or any other implied information of the original attribute shall be kept.
9.5. How to use it: Replace the respective attribute values with made up values. One way
to do this is to pre-generate a list of made up values, and randomly select from this list
to replace each of the original values. The made up values should be unique, and should
have no relationship to the original values (such that one can derive the original values
from the pseudonyms).
Other tips:
9.6. When allocating pseudonyms, ensure not to re-use pseudonyms that have already been
utilised (especially when they are randomly generated). Also avoid using the exact same
pseudonym generator over several attributes, without a change (e.g. at least use a
different random seed).
9.7. Persistent pseudonyms usually provide better utility by maintaining referential integrity
across datasets.
9.8. For reversible pseudonyms, the identity database cannot be shared with the recipient;
it should be securely kept and can only be used by the organisation to resolve any
specific queries (however, the number of such queries must be controlled, otherwise
they can be used to “decode” the entire pseudonymisation).
9.9. Similarly, if encryption is used, the encryption key cannot be shared, and in fact must
be securely protected from unauthorised access, because a leak of such a key could
7
For example, in the event that a research study yields results that would be able to provide useful warning to
a data subject.
16
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
result in a data breach by enabling the reversal of the encryption. The same applies for
pseudo-random number generators, which require a seed. Security of any key used
must be ensured like with any other type of encryption or reversible process8.
9.10. If encryption is used, review the method of encryption (e.g. algorithm and key length)
periodically to ensure that it is recognised by the industry as relevant and secure.
9.11. In some cases, pseudonyms may need to follow the structure or data type of the original
value (e.g. for pseudonyms to be usable in software applications, or simply to look more
similar to the original attribute); in such cases special pseudonym generators may be
needed to create synthetic datasets, or in some cases so-called “format preserving
encryption” can be considered, which creates pseudonyms which have the same format
as the original data.
9.12. Example
This example shows pseudonymisation being applied to the names of persons who
obtained their driving licenses, and some information about them. In this example,
the names were replaced with pseudonyms instead of the attribute being
suppressed, because the organisation wanted to be able to reverse the
pseudonymisation if necessary.
Before anonymisation:
Person Pre Assessment
Result
Hours of Lessons
Taken Before Passing
Joe Phang A 20
Zack Lim B 26
Eu Cheng San C 30
Linnie Mok D 29
Jeslyn Tan B 32
Chan Siew Lee A 25
After pseudonymising the Person attribute:
Person Pre Assessment
Result
Hours of Lessons
Taken Before Passing
416765 A 20
562396 B 26
964825 C 30
873892 D 29
239976 B 32
943145 A 25
8
Note that relying on a proprietary or “secret” reversal process (with or without key) is likely more prone to
decoding and the risk of being broken than relying on standard key based encryption.
17
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
For reversible pseudonymisation, the identity database is securely kept in case
there is a future legitimate need to identify individuals. Security controls (including
administrative and technical ones) should also be used to protect the identity
database.
Identity database (single coding):
Pseudonym Person
416765 Joe Phang
562396 Zack Lim
964825 Eu Cheng San
873892 Linnie Mok
239976 Jeslyn Tan
943145 Chan Siew Lee
9.13. Example
For added security regarding the identity database, double coding can be used.
Continuing from the previous example, this example shows the additional linking
database, which is placed with a trusted third party. With double coding, the identity
of the individuals can only be known when both the trusted third party (having the
linking database) and the organisation (having the identity database) put their
databases together.
After anonymisation:
Person Pre Assessment
Result
Hours of Lessons
Taken Before
Passing
373666 A 20
594824 B 26
839933 C 30
280074 D 29
746791 B 32
785282 A 25
Linking database (securely kept by a trusted third party only; even the organisation
will remove it eventually. The third party is not given any other information)
Pseudonym Interim
Pseudonym
373666 OQCPBL
594824 ALGKTY
839933 CGFFNF
280074 BZMHCP
746791 RTJYGR
785282 RCNVJD
18
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Identity database (securely kept by the organisation)
Interim
pseudonym
Person
OQCPBL Joe Phang
ALGKTY Zack Lim
CGFFNF Eu Cheng San
BZMHCP Linnie Mok
RTJYGR Jeslyn Tan
RCNVJD Chan Siew Lee
Note: in both the linking database and identity database, it is good practice to
scramble the order of the records rather than leave it in the same order as the
dataset. In this example the two are left in the original order for easier visualisation.
10 Generalisation
10.1. Description: a deliberate reduction in the precision of data. E.g. converting a person’s
age into an age range, or a precise location into a less precise location. This technique
is also referred to as recoding.
10.2. When to use it: for values that can be generalised and still be useful for the intended
purpose.
10.3. How to use it: Design appropriate data categories and rules for translating data.
Consider suppressing any records that still stand out after the translation (i.e. the
generalisation).
Other tips:
10.4. Design the data ranges with appropriate sizes. Data ranges that are too large may mean
that the data may be “modified” very much, while data ranges that are too small may
mean that the data is hardly modified and therefore still easy to re-identify. If kanonymity
is used, the k value chosen will affect the data ranges too. Note that the first
and the last range may be a larger range to accommodate the typically lower number
of records at these ends; this is often referred to as top/bottom coding.
10.5. Example
In this example, this dataset contains person name (which has already been
pseudonymised), age in years, and residential address.
19
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Before anonymisation:
S/n Person Age Address
1 357703 24 700 Toa Payoh Lorong 5
2 233121 31 800 Ang Mo Kio Avenue 12
3 938637 44 900 Jurong East Street 70
4 591493 29 750 Toa Payoh Lorong 5
5 202626 23 5 Tampines Street 90
6 888948 75 1 Stonehenge Road
7 175878 28 10 Tampines Street 90
8 312304 50 50 Jurong East Street 70
9 214025 30 720 Toa Payoh Lorong 5
10 271714 37 830 Ang Mo Kio Avenue 12
11 341338 22 15 Tampines Street 90
12 529057 25 18 Tampines Street 90
13 390438 39 840 Ang Mo Kio Avenue 12
For the age, the approach taken is to generalise into the following age ranges.
< 20
21-30
31-40
41-50
51-60
> 60
For the address, one possible approach is to remove the block/house number and
retain only the road name.
After generalisation of Age and Address:
S/n Person Age Address
1 357703 21-30 Toa Payoh Lorong 5
2 233121 31-40 Ang Mo Kio Avenue 12
3 938637 41-50 Jurong East Street 70
4 591493 21-30 Toa Payoh Lorong 5
5 202626 21-30 Tampines Street 90
6 888948 >60 Stonehenge Road
7 175878 21-30 Tampines Street 90
8 312304 41-50 Jurong East Street 70
9 214025 21-30 Toa Payoh Lorong 5
10 271714 31-40 Ang Mo Kio Avenue 12
11 341338 21-30 Tampines Street 90
12 529057 21-30 Tampines Street 90
13 390438 31-40 Ang Mo Kio Avenue 12
20
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Supposing there is, in fact, only 1 residential unit on Stonehenge Road, as an example.
The exact address can hence be derived, even though the data has gone through the
generalisation. This could be considered as being still “too unique”.
Hence, as a next step of generalisation, record 6 could be removed (i.e. using the
suppression technique) as the address is still too unique after removing the unit
number. Alternatively, all the addresses could be generalised to a greater extent (e.g.
town or district) such that suppression is not needed, but this might affect the utility of
the data much more than suppression a few records from the dataset.
11 Swapping
11.1. Description: The purpose of swapping is to rearrange data in the dataset such that the
individual attribute values are still represented in the dataset, but generally, do not
correspond to the original records. This technique is also referred to as shuffling and
permutation.
11.2. When to use it: when subsequent analysis only needs to look at aggregated data, or
analysis is at the intra-attribute level; in other words, there is no need for analysis of
relationships between attributes at the record level.
11.3. How to use it: First, identify which attributes to swap. Then, for each, swap or reassign
the attribute values to any record in the dataset.
11.4. Other tips: Assess and decide which attributes (columns) need to be swapped.
Depending on the situation, organisations may decide that, for instance, only attributes
(columns) containing values that are relatively identifying, need to be swapped.
11.5. Example
In this example, the dataset contains information about customer records for a
business organisation.
Before anonymisation:
Person Job Title Date of Birth Membership
Type
Average Visits
per Month
A University
dean
3 Jan 1970 Silver 0
B Salesman 5 Feb 1972 Platinum 5
C Lawyer 7 Mar 1985 Gold 2
D IT professional 10 Apr 1990 Silver 1
E Nurse 13 May 1995 Silver 2
After anonymisation:
21
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
In this example, all values for all attributes have been swapped.
Person Job Title Date of Birth Membership
Type
Average Visits
per Month
A Lawyer 10 Apr 1990 Silver 1
B Nurse 7 Mar 1985 Silver 2
C Salesman 13 May 1995 Platinum 5
D IT professional 3 Jan 1970 Silver 2
E University dean 5 Feb 1972 Gold 0
Note: On the other hand, if the purpose of the anonymised dataset is to study the
relationships between job profile and consumption patterns, other methods of
anonymisation may be more suitable, e.g. via generalisation of job titles, which could
result in “university dean” being modified to become “educator”.
12 Data Perturbation
12.1. Description: the values from the original dataset are modified to be slightly different.
12.2. When to use it: for quasi-identifiers (typically numbers and dates) which may
potentially be identifying when combined with other data sources, and slight changes
in value are acceptable. This technique should not be used where data accuracy is
crucial.
12.3. How to use it: it depends on the exact data perturbation technique used. These include
rounding and adding random noise. The example in this section shows base-x rounding.
Other tips:
12.4. The degree of perturbation should be proportionate to the range of values of the
attribute. If the base is too small, the anonymisation effect will be weaker; on the other
hand, if the base is too large, the end values will be too different from the original and
utility of the dataset will likely be reduced.
12.5. Note that where computation is performed on attribute values which have been
perturbed, the resulting value may experience perturbation to an even larger extent.
12.6. Example
In this example, the dataset contains information to be used for research on possible
linkage between a person’s height, weight, age, whether the person smokes, and
whether the person has “disease A” and/or “disease B”. The person’s name has already
been pseudonymised.
22
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
The following rounding is then applied:
Attribute Anonymisation technique
Height (in cm) Base-5 rounding (5 is chosen to be somewhat
proportionate to the typical height value of, e.g. 120
to 190 cm)
Weight (in kg) Base-3 rounding (3 is chosen to be somewhat
proportionate to the typical weight value of, e.g. 40
to 100 kg)
Age (in years) Base-3 rounding (3 is chosen to be somewhat
proportionate to the typical age value of, e.g. 10 to
100 years)
(the remaining
attributes)
Nil, due to being non-numerical and difficult to
modify without substantial change in value
Dataset before anonymisation:
Person Height
(cm)
Weight
(kg)
Age
(years)
Smokes? Disease A? Disease B?
198740 160 50 30 No No No
287402 177 70 36 No No Yes
398747 158 46 20 Yes Yes No
498732 173 75 22 No No No
598772 169 82 44 Yes Yes Yes
Dataset after anonymisation (shaded columns represent the affected attributes):
Person Height
(cm)
Weight
(kg)
Age
(years)
Smokes? Disease A? Disease B?
198740 160 51 30 No No No
287402 175 69 36 No No Yes
398747 160 45 18 Yes Yes No
498732 175 75 21 No No No
598772 170 81 42 Yes Yes Yes
Note: for base-x rounding, the attribute values to be rounded are rounded to the nearest
multiple of x.
13 Synthetic Data
13.1. Description: this technique is slightly different as compared to the other techniques
described in this Guide, as it is mainly used to generate synthetic datasets directly and
separately from the original data, instead of modifying the original dataset.
13.2. When to use it: typically, when a large amount of data is required for system testing,
but the actual data cannot be used and yet the data should be “realistic” in certain
aspects, like format, relationship among attributes, etc.
23
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
13.3. How to use it: study the patterns from the original dataset (i.e. the actual data) and
apply the patterns when creating the “anonymised” dataset (i.e. the synthetic data).
The degree to which the patterns from the original dataset need to be replicated
depends on the how the anonymised dataset is to be used.
Other tips:
13.4. Depending on the test scope and the administrative controls, fully or partially synthetic
data can be generated; e.g. where tests are conducted, which need to reference to
other datasets, then those few items being tested need to remain in their original form,
but other information could be synthetic.
13.5. While in the other techniques, typically the anonymised data is the same or about the
same (e.g. when suppression or aggregation are applied) volume as the original data,
synthetic data can be generated in any volume, as needed.
13.6. When applying this technique, outliers may need additional attention. For testing
purposes outliers are often very valuable, but outliers in the synthetic data may also
indicate certain outliers within the original dataset. It is therefore recommended to
create outliers in synthetic data intentionally and independent of the original data.
13.7. This technique is of rather little utility for data analysis, because the data is not “real”
and the data was created based on a pre-conceived model.
13.8. Example
In this example, an office facility which specialises in providing “hot-desking” facilities
keep records of the time that users start and end using their facilities. They would like
to create a set of synthetic data to perform simulation testing on a new facility
allocation algorithm.
A detailed discussion of statistical measures is beyond the scope of this Guide, however
in this example, some possible measures could be the average or median number of
users during each hour of the day.
Original dataset:
User Date Time in Time out
User A 1-Mar-17 8:27 18:04
User A 2-Mar-17 8:20 18:10
User B 1-Mar-17 8:45 17:17
User B 2-Mar-17 8:55 17:54
User C 1-Mar-17 13:18 15:48
User C 2-Mar-17 13:02 16:02
User D 1-Mar-17 17:55 7:31
User D 2-Mar-17 18:04 7:39
(etc.) (etc.) (etc.) (etc.)
24
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Statistics obtained from original dataset
Start Time End Time Average No. of Users
0:00 1:00 130
1:00 2:00 98
2:00 3:00 102
3:00 4:00 95
4:00 5:00 84
5:00 6:00 72
6:00 7:00 62
7:00 8:00 144
8:00 9:00 450
9:00 10:00 506
(etc.) (etc.) (etc.)
22:00 23:00 138
23:00 0:00 132
Synthetic dataset (for 1 day):
User Date Time in Time out
100001 3-Apr-17 8:25 17:53
100002 3-Apr-17 8:00 18:04
100003 3-Apr-17 8:12 18:48
100004 3-Apr-17 8:49 18:02
100005 3-Apr-17 8:33 18:11
100006 3-Apr-17 8:37 18:05
100007 3-Apr-17 8:55 20:05
100008 3-Apr-17 8:23 18:34
100009 3-Apr-17 13:16 15:48
100010 3-Apr-17 13:03 15:11
100011 3-Apr-17 13:28 15:25
100012 3-Apr-17 13:18 15:32
100013 3-Apr-17 17:55 7:38
100014 3-Apr-17 18:04 7:32
100015 3-Apr-17 17:57 7:02
(etc.) (etc.) (etc.) (etc.)
Note: basically, the synthetic dataset is created based on the statistics derived from the
original dataset, e.g. the average number of users in office at different time periods of the
day.
25
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
14 Data Aggregation
14.1. Description: converting a dataset from a list of records to summarised values.
14.2. When to use it: when individual records are not required and aggregated data is
sufficient for the purpose.
14.3. How to use it: a detailed discussion of statistical measures is beyond the scope of this
Guide, however typical ways include using totals or averages, etc. It might also be also
useful to discuss with the data recipient about the expected utility and find a suitable
compromise.
Other tips:
14.4. Where applicable, watch out for groups having too few records after performing
aggregation. E.g. in the below example if the aggregated data includes a single record
in any of the categories, it could be easy for someone with some additional knowledge
to identify a donor.
14.5. Hence, aggregation may need to be applied in combination with suppression. Some
attribute may need to be removed, as they contain details which cannot be aggregated,
and new attributes might need be added, e.g. to contain the newly computed aggregate
values.
14.6. Example
In this example, a charity organisation has records of the donations made, as well as
some information about the donors.
The charity organisation assessed that aggregated data is sufficient for an external
consultant to perform data analysis, hence performs data aggregation on the original
dataset.
Original dataset:
Donor Monthly Income ($) Amount donated in 2016 ($)
Donor A 4000 210
Donor B 4900 420
Donor C 2200 150
Donor D 4200 110
Donor E 5500 260
Donor F 2600 40
Donor G 3300 130
Donor H 5500 210
Donor I 1600 380
Donor J 3200 80
Donor K 2000 440
26
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Donor L 5800 400
Donor M 4600 390
Donor N 1900 480
Donor O 1700 320
Donor P 2400 330
Donor Q 4300 390
Donor R 2300 260
Donor S 3500 80
Donor T 1700 290
Anonymised dataset:
Monthly Income ($) No. of Donations
Received (2016)
Sum of Amount
donated in 2016 ($)
1000-1999 4 1470
2000-2999 5 1220
3000-3999 3 290
4000-4999 5 1520
5000-6000 3 870
Grand Total 20 5370
PART 4: PUTTING IT TOGETHER
15 Anonymisation Methodology
15.1. While Part 3 of this Guide focussed on various basic anonymisation techniques,
anonymisation requires more than just applying the appropriate technique(s). Part 4
looks as the bigger picture and discusses what else needs to be considered. Please note
that this description is mainly focussing on non-public release; public release models
might need additional and more detailed considerations.
15.2. The following is a suggested methodology for performing anonymisation:
1) Determine the release model.
This refers to how the anonymised dataset will be released. Public refers to making
it available to basically anyone. Non-public refers to a controlled release to limited
(and often, a fixed number of) known recipients. The public release model poses
inherently more challenges on the anonymisation techniques.
2) Determine the acceptable re-identification risk threshold as well as the expected
utility and risk threshold intended or required.
Refer to section 17 for more details. Note that the risk threshold set at this stage
must be clearly distinguished if the additional controls are taken into consideration
or only reflect the risk of the data.
27
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
3) Classify data attributes.
This is about classifying the attributes in the dataset as either direct identifiers,
indirect identifiers, or non-identifiers, which affects how the attributes will
subsequently be processed.
4) Remove unused data attributes.
In the process of anonymisation, usually most attributes, whether direct or indirect
identifiers, require processing or at least consideration, so as to become less
identifying. Hence, any attribute that is clearly not required in the anonymised
dataset should be suppressed.
5) Anonymise direct and indirect identifiers.
This is done by applying techniques such as those described in this Guide. Different
techniques are applicable for types of identifiers. Some techniques can (and often,
should) be used in combination. Outlier records should be considered for record
suppression.
6) Determine actual risk and compare against threshold.
Refer to Section 17 for more details.
7) Perform more anonymisation, if necessary.
If the actual risk is higher than the threshold, “stronger” anonymisation is required
and steps 5 to 7 should be performed again with the necessary adjustments, until
the actual risk is lower than the threshold.
8) Evaluate the solution.
This includes examining the anonymised dataset to assess if the utility meets the
target. If the utility is insufficient, the anonymisation process may need to be redesigned,
or it may be considered whether anonymisation is feasible for this dataset.
9) Determine controls required.
Controls include both technical and non-technical (e.g. legal and organisational
measure). Technical controls are further described in section 18.
10) Document the anonymisation process.
The details of the anonymisation process, parameters used and controls should be
clearly recorded for future reference. Such documentation facilitates review,
maintenance, fine-tuning and audits. Note that such documentation should be kept
securely as the release of the parameters may facilitate re-identification.
28
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
16 K-anonymity – a measure of risk
16.1. K-anonymity (and similar extensions to it like L-diversity and T-closeness) is sometimes
thought of as an anonymisation technique, but it is more of a measure used to ensure
that risk threshold has not been surpassed, as part of the anonymisation methodology
(see in particular step 6).
16.2. K-anonymity is not the only measure available nor is it without its limitations but it is
relatively well understood and easy to apply. Alternative methods such as differential
privacy9 have emerged over the past few years.
16.3. Description: The k-anonymity model is used as a guideline before and for verification,
after anonymisation techniques (e.g. generalisation) have been applied, to ensure any
record’s direct and/or indirect identifiers are shared by at least k-1 other records.
This is the key protection provided by k-anonymity against linking attacks, because k
records (or at least different direct and indirect identifiers) are identical in the
identifying attributes (and thereby create an “equivalence class” with k members), and
therefore it is not possible to link or single out an individual’s record; there are always
k identical attributes.
An anonymised dataset may have different k-anonymity levels for different sets of
indirect identifiers, but to assess the protection against linking, the lowest k is used for
comparison against the threshold.
16.4. When to use it: to confirm that the anonymisation measures put in place achieve the
desired threshold against linking attacks.
16.5. How to use it: First, decide on a value for k (which is basically equal to or higher than
the inverse of the equivalence class size), which provides the lowest k to be achieved
among all equivalence classes. Generally, the higher the value of k, the harder it is for
data subjects to be identified; however, utility may become lower as k increases and
more records may need to be suppressed. After other anonymisation techniques are
applied, check that each record has at least k-1 other records with the same attributes
addressed by the k-anonymisation. Records in equivalence classes with less than k
records, should be considered for suppression; alternatively, more anonymisation can
be done.
9
Differential privacy involves several concepts, including answering queries instead of providing the
anonymised dataset, adding randomised noise to the protect individual records, providing mathematical
guarantees that the pre-defined “privacy budget” is not exceeded, etc.
29
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Other tips:
16.6. Besides generalisation and suppression, synthetic data can also be created (e.g. near to
the outliers) to achieve k-anonymity. These techniques (and others) can sometimes be
used in combination, but note that the exact way chosen can affect data utility. Consider
the trade-offs between dropping the outliers or inserting synthetic data.
16.7. K-anonymity assumes that each record pertains to a different individual. If the same
individual has multiple records (e.g. visiting the hospital on several occasions), then kanonymity
will need to be higher than the repeat records, else the records may not only
be linkable, but might despite seemingly fulfilling “k equivalence classes” be reidentifiable
from the records.
16.8. Example
In this example, the dataset contains information about people taking taxis.
K = 2 is used, i.e. each record should eventually share the same attributes with 1 other
record, after anonymisation. Note: k=2 is used for simplifying the example but it is
probably too low a value for actual data, because this means the risk of identification
would be 50%.
The following anonymisation techniques are used in combination and the level of
granularity is one example which allows to achieve the required k level.
Attribute Anonymisation technique
Age Generalisation (10-year intervals)
Occupation Generalisation – e.g. both “Database administrator”
and “programmer” are generalised to “IT”
Record
suppression
Records that do not meet the 2-anonymity criteria
after anonymisation techniques have been applied
(in this case, generalisation), are removed. E.g. for the
case of the banker who is the only such data subject.
Dataset before anonymisation:
Age Gender Occupation Average No. of Trips
per Week
21 Female Legal Counsel 15
38 Male Data Privacy Officer 2
25 Female Banker 8
44 Female Database
Administrator
3
25 Female Administrative
Assistant
1
31 Male Data Privacy Officer 5
42 Female Programmer 3
30
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
22 Female Administrative
Assistant
4
30 Female Legal Counsel 2
Dataset after anonymisation of age and occupation and suppression of outlier (the
respective equivalence classes are highlighted in different colours):
Age Gender Occupation Average number of
Trips per Week
21 to 30 Female Legal Counsel 15
31 to 40 Male Data Privacy Officer 2
21 to 30 Female Banker 8
41 to 50 Female IT 3
21 to 30 Female Administrative
Assistant
1
31 to 40 Male Data Privacy Officer 5
41 to 50 Female IT 3
21 to 30 Female Administrative
Assistant
4
21 to 30 Female Legal Counsel 2
Note: The average number of trips per week is taken here as an example for a non-identifier,
without a need to further anonymise this attribute. Also note that a dataset following kanonymity
without other non-identifiers or other attributes can be simplified by removing all
the duplicates and just indicating the value of k.
17 Assessing the Risk of Re-Identification
17.1. There are various ways to assess the risk of re-identification, and these may involve
rather complex calculations involving computation of probabilities. Refer to the
reference publications in Annex B for detailed information.
17.2. This section describes a simplified model, using k-anonymity10, and making certain
assumptions. One of the assumptions is that the release model is non-public. The
second assumption is that attack attempts to link an individual to the anonymised
dataset. The third assumption is that the content of the anonymised data is not taken
into consideration and that the risk is calculated independent of what kind of
information the attacker actually has available.
17.3. First, the risk threshold should be established. This value, reflecting a probability, ranges
between 0 to 1. It reflects the risk level that the organisation is willing to accept. The
main factors affecting this should include the harm that could be caused to the data
subject, as well as the harm to the organisation, should re-identification take place; but
10
The calculations would be different if done using, e.g. differential privacy or traditional statistical disclosure
controls.
31
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
it also takes into consideration what other controls have been put in place to mitigate
risk in other forms than anonymisation. The higher the potential harm, the higher the
risk threshold should be. There are no hard and fast rules as to what risk threshold
values should be used; the following are just examples:
Potential Harm Risk Threshold
Low 0.2
Medium 0.1
High 0.01
17.4. In computing the actual risk, this Guide explains looking into the “Prosecutor Risk”,
which assumes the adversary knows a specific person in the dataset and is trying to
establish which record in the dataset refers to that person.
17.5. The simple rule for calculating the probability of re-identification for a single record in
a dataset, is to take the inverse of the record’s equivalence class size, i.e.
P (link individual to a single record) = 1 / record’s equivalence class size
17.6. Now, to compute the probability of re-identification of any record in the entire dataset,
again, given that there is a re-identification attempt, a conservative approach would be
to equate it to the maximum probability of re-identification among all records in the
dataset.
P (re-ID any record in dataset) = 1 / Min. equivalence class size in dataset
Note: if the dataset has been k-anonymised,
P (re-ID any record in dataset) <= 1 / k
17.7. We can consider 3 re-identification attack scenarios: (1) the deliberate insider attack;
(2) inadvertent recognition by an acquaintance, and (3) data breach.
P (re-ID) = P (re-ID | re-ID attempt) x P (re-ID attempt)
Where P (re-ID | re-ID attempt) refers to the probability of successful reidentification,
given there is a re-identification attempt. As discussed earlier, we can
take P (re-ID | re-ID attempt) to be (1 / Min. equivalence class size in dataset)
Therefore, P (re-ID) = (1 / Min. equivalence class size in dataset) x P (re-ID attempt)
32
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
17.8. For scenario #1 – the deliberate insider attack, we assume a party receiving the dataset
attempts re-identification. To estimate P (re-ID attempt), i.e. the probability of a reidentification
attempt, factors that can be considered include the extent of mitigating
controls put in place as well as the motives and capacity of the adversary. The following
table presents example values; again it is for the party anonymising the dataset to
decide on suitable values to use.
P (re-ID attempt) for scenario
#1 – the deliberate insider
attack
Motivation and Resources of Adversary
Low Medium High
Extent of
Mitigating
Controls
High 0.03 0.05 0.1
Medium 0.2 0.25 0.3
Low 0.4 0.5 0.6
None 1.0 1.0 1.0
Factors affecting the motivation and resources of the adversary may include:
 Willingness to violate contract (assuming contract preventing re-identification) is
in place
 Financial and time constrains
 Inclusion of high profile personalities (e.g. celebrities) in the dataset
Factors affecting the extent of mitigating controls include:
 Organisational structures
 Administrative controls (e.g. contracts)
 Technical and physical measures (refer to section 18)
17.9. For scenario #2 - inadvertent recognition by an acquaintance, we assume a party
receiving the dataset inadvertently re-identifies a data subject while examining the
dataset. This is possible because the party has some additional knowledge about the
data subject due to their relationship (e.g. friend, neighbour, relative, colleague, etc.).
To estimate P (re-ID attempt), i.e. the probability of a re-identification attempt, the
main factor to be considered is the likelihood that the data recipient knows someone in
the dataset.
17.10. For scenario #3 – a data breach occurring at the data recipient’s ICT system, the
probability can be estimated based on available statistics on the prevalence of data
breaches in the data recipient’s industry. This is based on the assumption that the
attackers who obtained the dataset will attempt re-identification.
Scenario #3 – a data breach
33
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
P (re-ID attempt) = P (data breach in data recipient’s industry)
17.11. The highest probability among the 3 scenarios should be used as P (re-ID attempt).
P (re-ID attempt) = Max ( P(deliberate insider attack), P(inadvertent
recognition by an acquaintance), P(data breach) )
17.12. To put everything together,
P (re-ID) = (1 / Min. equivalence class size in dataset) x P (re-ID attempt)
= (1 / k) x P (re-ID attempt) for k-anonymised dataset
Where P (re-ID attempt) = Max (P (deliberate insider attack),
P (inadvertent recognition by an acquaintance), P(data breach))
18 Technical Controls
18.1. This section discusses technical controls that can be implemented to further reduce the
risk of re-identification after anonymisation. The controls may or may not be suitable
for the situation, depending on the policy of access to the anonymised dataset. Where
relevant, these controls can generally be implemented in combination with one another.
Note that some of these are only effective provided the anonymised dataset with high
residual risk is not passed over to the recipient, as once this has been done, technical
controls are typically not possible anymore. Also note that a risk-based approach should
be taken; hence the controls discussed in this section are for consideration and not
mandatory to adopt.
18.2. Revocable access – with records of access granted, it may be possible, depending on
the type of technical control used, to revoke access where deemed necessary. Typically,
this is easier to implement where only online access to the dataset is given.
18.3. Query only - Allowing queries to be made instead of providing direct access to the
dataset. An even more secure mode is for each query to be vetted by a curator who
assesses whether the specific query should even be granted.
34
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
18.4. Limit recipients – This is often done by implementing user authentication and
authorisation where online access to the dataset or to query is given, or password
protection or encryption where offline access to the dataset is given.
18.5. Digital Rights Management (DRM) Controls – This is usually done by providing online
access but implementing additional controls such as not allowing the user to save or
print the data. Note that there are limitations such as not being able to prevent
photographs to be taken of the onscreen data.
18.6. On-site access – Requiring the user to be physically present at the site where access to
the dataset is provided, or where access to perform queries is given. The additional
security comes being able to control what the user does with the data, e.g. to disallow
even photographs to be taken of the onscreen data. Additional security measures taken
at the site could include no network/internet connection provided, no phones or
external computers being allowed, CCTV surveillance, etc.
18.7. Providing only a subset of the anonymised dataset – The subset can also be a randomly
selected and/or perturbed.
18.8. Physical measures – The above measures are mostly relating to the control of access to
the anonymised data in digital form. Physical measures apply too; examples of these
include restricting physical access to devices or storage devices containing or able to
access anonymised data, as well as restricting access to printouts containing
anonymised data.
19 Governance
19.1. The methodology given in section 15 describes the steps required to methodically
anonymise a dataset. However, responsible anonymisation does not end there. Note
that a risk-based approach should be taken; hence the suggestions discussed in this
section are for consideration and not mandatory to adopt.
19.2. After the release of an anonymised dataset, proper governance relating to the
anonymised dataset is required even for non-public release. This may include the
following:
 Keeping track of anonymised datasets released by the organisation. Details include
the recipients and the method of access (e.g. providing a copy of the anonymised
dataset, or online access, or physical access, or query only, etc.) This includes different
variants/subsets of datasets, as well as datasets released by different parts of the
organisation; in both of these scenarios, combination of different datasets can lead to
re-identification.
35
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
 Management of keys and mapping tables – some anonymisation techniques, including
pseudonymisation, require the use of encryption keys, mapping tables, etc. It is crucial
for these to be properly managed and securely kept, as any party that gets hold of
these can immediately undo the anonymisation.
 Regularly reviewing re-identification risk and controls put in place
 Conducting audits on data recipients, to ensure they are complying with contractual
requirements
 Notifying the relevant parties if a data breach occurs11.
 Keeping track of compliance requirements and best practices regarding data
anonymisation.
20 Acknowledgements
20.1. In developing this Guide, best practices from personal data protection agencies and
other authorities in other countries were considered. Refer to Annex B for the guides
that were referenced.
20.2. We would like to express our appreciation to the following organisations for their
valuable input in the development of this Guide:
 Agency of Integrated Care (AIC)
 Changi General Hospital (CGH)
 Cyber Security Agency of Singapore (CSA)
 Government Technology Agency (GovTech)
 Institute of Mental Health (IMH)
 Integrated Health Information Systems Pte Ltd (IHIS)
 JurongHealth
 Khoo Teck Puat Hospital
 KK Women’s & Children’s Hospital (KKH)
 Ministry of Health (MOH)
 MOH Holdings Pte Ltd (MOHH)
 Nanyang Technological University (NTU)
 National Cancer Centre Singapore
 National Dental Centre Singapore
 National Healthcare Group (NHG)
 National Healthcare Group Polyclinics (NHGP)
 National Heart Centre Singapore
 National Neuroscience Institute
11
Refer to PDPC’s Guide to Managing Data Breaches
36
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
 National Skin Centre
 National University Health System (NUHS)
 National University Hospital (NUH)
 National University of Singapore (NUS)
 National University Polyclinics (NUP)
 Privitar Ltd
 Saw Swee Hock School of Public Health
 Sengkang Health
 Singapore Department of Statistics (SingStat)
 Singapore General Hospital (SGH)
 Singapore Health Services (Singhealth HQ)
 Singapore Management University (SMU)
 Singapore National Eye Centre
 Singhealth Polyclinics
 Tan Tock Seng Hospital (TTSH)
END OF DOCUMENT
37
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Annex A: Summary of Anonymisation Techniques
Technique Name When to Use Attribute type
Attribute
suppression
Attribute is not required in the
anonymised dataset
All
Record
suppression
Presence of outlier records N.A. (applies across entire
record, hence all
attributes affected)
Character
masking
Masking some characters in an attribute
provides sufficient anonymity
Direct identifier
Pseudonymisation Records still need to be distinguished from
each other in the anonymised dataset but
no part of the original attribute value can
be retained
Direct identifier
Generalisation Attributes can be modified to be less
precise but still be useful
All
Swapping No need for analysis of relationships
between attributes at the record level
All
Data perturbation Slight modification to the attributes are
acceptable
Indirect identifier
Synthetic data Large amounts of made up data similar in
nature to the original data are required,
e.g. for system testing
All
Data aggregation Individual records are not required and
aggregated data is sufficient
Indirect identifiers
38
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
Annex B: Main References
 “Advisory Guidelines on Key Concepts In the PDPA” (Chapter 5 – Personal Data).
https://www.pdpc.gov.sg/AG. Personal Data Protection Commission (Singapore), revised 27
July 2017
 “Advisory Guidelines on the PDPA for Selected Topics” (Chapter 3 – Anonymisation).
https://www.pdpc.gov.sg/AG. Personal Data Protection Commission (Singapore), revised 28
March 2017
 “De-identification Guidelines for Structured Data”. https://www.ipc.on.ca/wpcontent/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf.
Information
and Privacy Commissioner of Ontario, June 2016.
 “Guide to Managing Data Breaches”. https://www.pdpc.gov.sg/OG. Personal Data
Protection Commission (Singapore), 8 May 2015
 El Emam K. Guide to the De-Identification of Personal Health Information. CRC Press, 2013.
 “Opinion 05/2014 on Anonymisation Techniques”. http://ec.europa.eu/justice/data-
protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.
Article 29 Data Protection Working Party (European Commission),10 April 2014.
 “Personal Data Protection Act 2012”. Government Gazette.
https://sso.agc.gov.sg/Act/PDPA2012. Republic of Singapore, 7 December 2012
 S L Garfinkel. “NISTIR 8053: De-Identification of Personal Information”.
http://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf. National Institute of Standards
and Technology (NIST), October 2015.
39
GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018)
BROUGHT TO YOU BY
Copyright 2018 – Personal Data Protection Commission Singapore (PDPC)
This publication gives a general introduction to basic concepts and techniques of data anonymisation. The contents herein
are not intended to be an authoritative statement of the law or a substitute for legal or other professional advice. The PDPC
and its members, officers and employees shall not be responsible for any inaccuracy, error or omission in this publication or
liable for any damage or loss of any kind as a result of any use of or reliance on this publication.
The contents of this publication are protected by copyright, trademark or other forms of proprietary rights and may not be
reproduced, republished or transmitted in any form or by any means, in whole or in part, without written permission.