1. Trang chủ
  2. » Luận Văn - Báo Cáo

Preserving privacy for publishing time series data with differential privacy

110 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Preserving privacy for publishing time series data with differential privacy
Tác giả Lại Trung Minh Đức
Người hướng dẫn Assoc. Professor. DANG TRAN KHANH, PhD. LE LAM SON
Trường học Ho Chi Minh City University of Technology
Chuyên ngành Computer Science
Thể loại Thesis
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 110
Dung lượng 2,52 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • CHAPTER 1: OVERVIEW OF THE THESIS (16)
    • 1. Background and Context (16)
    • 2. Data Publishing and Privacy Preserving Data Publishing (17)
    • 3. Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series (18)
  • data 3 4. Differential Privacy as a powerful player (0)
    • 5. Thesis objectives (20)
    • 6. Thesis contributions (20)
    • 7. Thesis structure (20)
  • CHAPTER 2: PRIVACY MODELS RESEARCHS (22)
    • 1. Attack models and notable privacy models (22)
      • 1.1. Record linkage attack and k-Anonymity privacy model (22)
      • 1.2. Attribute linkage attack and l-diversity and t-closeness privacy model (23)
      • 1.3. Table linkage and δ-presence privacy model (25)
      • 1.4. Probabilistic linkage and Differential Privacy model (26)
    • 2. Summary (27)
  • CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY (29)
    • 1. The need for Differential Privacy principle (29)
      • 1.1. No need to model the attack model in detail (29)
      • 1.2. Quantifiable privacy loss (30)
      • 1.3. Multiple mechanisms composition (31)
    • 2. The promise (and not promised) of Differential Privacy (32)
      • 2.1. The promise (32)
      • 2.2. The not promise (33)
      • 2.3. Conclusion (33)
    • 3. Formal definition of Differential Privacy (33)
      • 3.1. Terms and notations (34)
      • 3.2. Randomized algorithm (35)
      • 3.3. 𝜀-differential privacy (0)
    • 4. Important concepts of Differential Privacy (37)
      • 4.1. The sensitivity (38)
      • 4.2. Privacy composition (39)
      • 4.3. Post processing (41)
    • 5. Foundation mechanisms of Differential Privacy (41)
      • 5.1. Local Differential Privacy and Global Differential Privacy (41)
      • 5.2. Laplace mechanism (42)
      • 5.3. Exponential mechanism (44)
    • 6. Notable mechanisms for Time-series data (45)
      • 6.1. Laplace mechanism (LPA – Laplace Perturbation Algorithm) (45)
      • 6.2. Discrete Fourier Transform (DFT) with Laplace mechanism (FPA – (46)
      • 6.3. Temporal perturbation mechanism (47)
      • 6.4. STL-DP – Perturbed time-series by applying DFT with Laplace (52)
  • CHAPTER 4: EXPERIMENT DESIGNS (53)
    • 1. Experiment designs (53)
      • 1.1. Case description (53)
      • 1.2. Data structure aligns with data provider (54)
      • 1.3. System alignments (55)
      • 1.4. Concerns and constraints (56)
    • 2. Problem analysis (56)
      • 2.1. Revisit the GDPR related terms for data sharing (56)
      • 2.2. Potential attack models and countermeasures (59)
      • 2.3. Define scope of work (60)
    • 3. Evaluation methodology (61)
      • 3.1. Data utility (61)
      • 3.2. Privacy metrics (61)
      • 3.3. Evaluation process (62)
    • 4. Privacy protection proposal (62)
  • CHAPTER 5: EXPERIMENT IMPLEMENTATIONS (64)
    • 1. Experiment preparation (64)
    • 2. Data exploration analysis (EDA) (64)
      • 2.1. Data overview (64)
      • 2.2. Descriptive Analysis (65)
      • 2.3. Maximum data domain estimation (67)
    • 3. Differential Privacy mechanisms implementation (69)
    • 4. Data perturbed evaluation (76)
      • 4.1. RFM Analysis for dataset (76)
      • 4.2. Forecasting trendline at categories, consumer-groups, and store level (80)
      • 4.3. Privacy evaluation (88)
      • 4.4. Recommendation for using Differential Privacy in data partnership use- cases. 77 (92)
  • CHAPTER 6: CONCLUSION AND FUTURE WORKS (93)

Nội dung

OVERVIEW OF THE THESIS

Background and Context

Data privacy, or information privacy, is the essential right of individuals to manage how their personal data is collected, used, shared, and stored by others This personal data includes any information that identifies or pertains to an individual, such as their name, address, phone number, email, health records, financial transactions, online activities, preferences, and opinions.

Time-series data presents distinct challenges in privacy protection due to its ability to reveal individual or group behaviors over time This type of data, which includes weblogs, social media posts, GPS traces, and health records, can be utilized for various applications like forecasting, anomaly detection, classification, clustering, and summarization However, the use of personal time-series data also poses significant risks if privacy breaches happen, highlighting the need for robust privacy measures.

Attackers can potentially identify users by correlating GPS traces or weblogs with publicly available data Analyzing temporal patterns in sensor readings or social media can also lead to user identification Additionally, monitoring changes in health records may disclose an individual's preferences and activities These examples highlight the critical importance of protecting privacy in time-series data to reduce risks related to unauthorized access and misuse of personal information.

In the realm of collaboration, effective data sharing among researchers from various corporations is essential for conducting analyses that benefit society, such as the COVID-19 mobility data shared by Google and Apple Although data may undergo anonymization prior to publication, there remains a considerable risk of identity re-identification, raising concerns about privacy and security.

In 2006, AOL released a dataset containing 20 million search queries from 650,000 users over three months for research purposes, anonymizing user IDs with random numbers Despite this, researchers and journalists successfully re-identified some users by correlating their search queries with publicly available information, leading to instances of embarrassment, harassment, and legal issues for those affected.

In 2007, Netflix released an anonymized dataset of 100 million movie ratings from 480,000 users to enhance its recommendation system through a contest Despite the removal of usernames and identifying information, researchers managed to re-identify some users by cross-referencing their ratings with data from IMDb This raised privacy concerns, as the exposed movie preferences could potentially disclose users' political views, sexual orientation, or health conditions.

Data Publishing and Privacy Preserving Data Publishing

Data publishing involves sharing information for analysis and decision-making, but it often includes sensitive personal details like identities and health records Without adequate safeguards, publishing this data can result in privacy violations and jeopardize individuals' rights and interests.

Privacy-preserving data publishing (PPDP) focuses on safeguarding individuals' privacy while ensuring the data remains useful for legitimate purposes It employs various techniques to anonymize or transform data prior to publication, effectively concealing sensitive information while preserving essential patterns and statistics.

Various data types can be published, including tabular data like census and medical records, as well as graph data such as social networks and web graphs The choice of privacy-preserving data publishing (PPDP) techniques depends on the specific data type and associated privacy requirements Common PPDP techniques include a range of methods tailored to safeguard sensitive information while allowing for data utilization.

Generalization is a data privacy technique that involves substituting specific values with broader categories, such as converting exact ages into age ranges or replacing zip codes with larger geographic regions This approach decreases data granularity, making it more challenging to pinpoint individual identities based on their characteristics.

Suppression is a data management technique that involves the complete removal of specific values or records, such as names and phone numbers, as well as the exclusion of outliers This process effectively decreases the overall information within the dataset, making it more challenging to connect individuals across various data sources.

Perturbation is a data privacy technique that introduces randomness by altering original values, such as swapping record values, adjusting numerical figures slightly, or flipping binary bits This process complicates the inference of true data values through statistical analysis, enhancing data security.

• Encryption: This technique transforms the data using cryptographic methods, such as hashing, encryption, or homomorphic encryption

Encryption protects the confidentiality of the data and makes it harder to access the original values without a secret key.

Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series

PPDP (Privacy-Preserving Data Publishing) focuses on safeguarding the privacy of individuals and organizations while enabling insightful data analysis on published datasets Time-series data, which includes records of variable values over time like sensor readings, stock prices, and health records, presents unique challenges for PPDP.

Time-series data frequently display significant temporal or spatial correlations, indicating that the values of certain attributes are influenced by other attributes or prior time points This interconnectedness poses a risk of privacy violations, as adversaries may leverage these correlations to deduce sensitive information from the released data.

• Dynamics: Time-series data is often updated or streamed continuously, which requires efficient and adaptive PPDP methods that can handle dynamic changes and updates without compromising privacy or utility

4 Differential Privacy as a powerful player

To address these challenges above, there is a principle that is accepted and explored by many researchers nowadays, called Differential Privacy – invented in 2006 by Professor Cynthia Dwork (adapted from [7], [11], [16], [21], [22])

Differential privacy ensures that data holders guarantee individuals that their data will remain confidential, regardless of other available information This means that using a person's data in research or analysis will not impact their privacy, either positively or negatively The best differential private database systems allow for the secure sharing of sensitive data for effective analysis without requiring clean data rooms, data usage agreements, or restricted access.

Differential privacy ensures that conclusions, such as the link between smoking and cancer, remain consistent whether individuals participate in data collection or not It provides a guarantee that every output sequence, which includes responses to queries, maintains this level of privacy.

Probabilities are determined by random selections governed by the privacy mechanism set by the data curator, ensuring that outcomes are equally likely to occur regardless of individual presence or absence The term "basically" is represented by a specific parameter within this context.

Differential privacy can be achieved by adding carefully calibrated noise to the original data or to the output of a data analysis function Differential privacy has

4 Differential Privacy as a powerful player

Thesis objectives

• To review the traditional Privacy Preserving Data Publishing methods and the efforts for time-series data

• To understand the theories and principles of Differential Privacy

• To explore notable mechanisms in Differential Privacy for time-series data

• To explore and solve the privacy use-case in the data partnership with Differential Privacy and other techniques Then, build the process to apply privacy techniques into business collaboration.

Thesis contributions

• An effort to make differential privacy gets easier to understand, especially for non-academic audiences and corporate audiences

• A suggested guideline to apply and evaluate differential privacy on time-series data in the use-case of data collaboration between multiple parties.

Thesis structure

Chapter 1: Background and context (this chapter) – summaries of the purpose of data publishing, the purpose of privacy protection for data released, the challenges on time-series data, and the brief of the Differential Privacy principle

Chapter 2: Literature review – summaries and compares privacy attack models with notable algorithms, quickly analyses the weaknesses of its and the connection to Differential Privacy

Chapter 3: Investigation on Differential Privacy – explore the mathematical concepts, theories, foundational mechanisms, and related techniques for time-series data

Chapter 4: Experiment designs – based on the synthesis use-case from real world problem, analyses and proposes a process and solution to protect privacy of individuals in data partnership process

Chapter 5: Experiment implementations – based on the proposal in Chapter 4 and utilize knowledge in Differential Privacy technique and Data Analytics to implement related requirements

Chapter 6: Conclusion and Future works

PRIVACY MODELS RESEARCHS

Attack models and notable privacy models

1.1 Record linkage attack and k-Anonymity privacy model

Record linkage attacks are privacy breaches that leverage the connection of records from various data sources to uncover sensitive individual information These attacks often involve pinpointing records in distinct datasets that relate to the same person and merging them to gain further insights Typically, these attacks utilize quasi-identifiers (QIDs) like age, gender, or ZIP code to effectively link records associated with the same individual across multiple datasets.

According to research by Sweeny [9], using only a ZIP code, date of birth, and gender, there is an 87% likelihood of uniquely identifying individuals in the United States, indicating a significant risk of personal identification.

The k-anonymity method, proposed by Sweeny, addresses privacy concerns by ensuring that if a record in a dataset has a specific quasi-identifying value (QID), at least k-1 other records share the same value This approach makes each record indistinguishable from a minimum of k-1 others regarding the QID, which reduces the likelihood of linking a victim to a particular record to a maximum probability of 1/k To achieve this, k-anonymity employs data generalization and suppression techniques, ensuring that every combination of values for quasi-identifiers can be matched indistinguishably to at least k individuals within an equivalence class.

K-anonymity effectively safeguards against identity disclosure by ensuring that individuals cannot be uniquely identified within a database However, it does not address the issue of attribute disclosure, which happens when sensitive attributes of an individual can be revealed without needing to pinpoint their exact record.

Figure 1 A fictitious database with hospital data, taken from [14]

1.2 Attribute linkage attack and l-diversity and t-closeness privacy model

The attribute linkage attack allows an attacker to deduce sensitive information about a victim by leveraging the sensitive values of the group to which the victim belongs, even without direct access to the victim's record K-anonymized databases are vulnerable to attribute disclosure through two main types of attacks: the homogeneity attack, which occurs when sensitive attributes are not diverse, and the background knowledge attack, which relies on the attacker having partial knowledge about an individual or the distribution of sensitive and non-sensitive attributes within a population.

In 2006, Machanavajjhala introduced the l-diversity method to combat attribute linkage attacks, focusing on maintaining diversity in sensitive attributes across equivalence classes A database achieves l-diversity when each equivalence class contains at least l distinct values for the sensitive attribute To compromise this l-diversity, an adversary must have (l-1) pieces of background knowledge about the individuals in the database.

A significant limitation of l-diversity is its potential difficulty or lack of necessity in certain scenarios For instance, in a dataset of 10,000 records where 99% belong to class A and only 1% to class B, the sensitivity of the attributes varies In this case, individuals in class A may not be concerned about being tested negative, making l-diversity redundant for equivalence classes containing only A-entries Achieving 2-diversity in this scenario would result in only 100 equivalence classes, leading to substantial information loss.

Both l-diversity and k-anonymity fail to adequately safeguard against attribute disclosure due to potential skewness and similarity attacks A skewness attack can occur when the distribution of sensitive attributes is uneven, such as a 99% to 1% ratio, increasing the risk of misclassification within an equivalence class For example, even if a database maintains 2-diversity, the likelihood of an individual being identified as positive could rise from 1% to 50%, jeopardizing their privacy Similarly, similarity attacks arise when sensitive attribute values, although distinct, are semantically related For instance, in a hospital setting, an equivalence class containing pulmonary cancer, breast cancer, and brain cancer may lead to the inference that an individual has cancer, despite the differences in specific diagnoses.

To address the limitations of l-diversity, particularly against similarity and skewness attacks, Li et al introduced t-closeness, a more rigorous privacy model This model mandates that the distribution of sensitive attribute values within each equivalence class must be t-close to the overall distribution in the dataset By doing so, it prevents attackers from inferring the sensitive attribute values of individual records based solely on the distribution within their respective equivalence classes.

T-closeness guarantees that the distribution of sensitive attribute values within each equivalence class closely resembles the overall distribution in the dataset, thereby complicating an attacker's ability to deduce the sensitive attribute value of any individual record.

While t-closeness offers superior protection compared to l-diversity, it presents significant drawbacks, including high computational costs and substantial loss of data utility As a result, in the age of big data, both l-diversity and t-closeness are rarely regarded as best practices.

1.3 Table linkage and δ-presence privacy model

Record linkage and attribute linkage generally operate under the assumption that an attacker knows the victim's record is included in a published dataset However, this assumption may not hold true in all cases, as the presence or absence of a victim's record can still expose sensitive information For example, if a medical facility publishes data related to a specific illness, the inclusion of a victim's record could lead to harmful repercussions Table linkage occurs when an attacker can accurately determine whether the victim's record is present or absent in the published data.

In 2007, Nergiz introduced the delta-presence (δ-presence) approach to address the table linkage problem, which involves constraining the probability of inferring a victim's data presence within a specific range, δ This method effectively limits an attacker's confidence to δ%, thereby reducing the likelihood of successfully linking to the victim's record and sensitive attributes As a result, the δ-presence approach serves as an indirect measure to mitigate risks related to record and attribute linkages.

The δ-presence approach offers a secure privacy model, but it is based on the assumption that the data publisher has access to the same external table as the attacker, a premise that may not hold true in real-world scenarios.

1.4 Probabilistic linkage and Differential Privacy model

A unique category of privacy models aims to shift an attacker's probabilistic beliefs about a victim's sensitive information rather than pinpointing specific records or attributes linked to an individual These models prioritize the uninformative principle, which seeks to reduce the gap between an attacker's prior and posterior beliefs after accessing published data.

Summary

Data privacy protection principles are essential, yet no single solution can ensure complete privacy k-Anonymity was created to prevent record linkage, but when it falls short against attribute linkage, l-Diversity was introduced, followed by the stricter t-Closeness To mitigate risks of individual information exposure, δ-presence was developed Additionally, Cynthia Dwork contributed to privacy solutions through Differential Privacy, a mathematical framework that uses measurable noise to address privacy issues effectively.

Based on extensive research ([11], [21], [25]), Differential Privacy offers remarkable advantages, as summarized below:

Differential Privacy offers robust protection against a variety of attack models without needing intricate modeling of specific threats It ensures privacy is maintained regardless of the attacker's objectives or prior knowledge, effectively safeguarding sensitive information.

Differential Privacy offers a quantifiable approach to privacy protection by determining the necessary amount of noise to add to data, ensuring a balance between privacy and data utility This technique builds on long-standing practices like Statistical Disclosure Control, which have been in use since the 1970s.

• Composition of multiple mechanisms: With strong mathematical proofs, the

"composition" technique in Differential Privacy enables the combination of multiple privacy algorithms, resulting in a more robust and reasonable solution

Consequently, the forthcoming chapter will focus on the exploration and investigation of Differential Privacy, including its application in real-world scenarios.

THE INVESTIGATION ON DIFFERENTIAL PRIVACY

The need for Differential Privacy principle

Differential Privacy (DP) addresses the limitations of k-Anonymity and l-Diversity by effectively handling multiple attack types while offering improved performance due to its low computational complexity This makes DP a compelling option for enhancing data privacy.

1.1 No need to model the attack model in detail

Previous privacy models relied on specific assumptions regarding the attacker's capabilities and goals To choose the right privacy concept, it was essential to comprehend the attacker's prior knowledge, the additional data they could access, and the precise information they sought to reveal.

Implementing these definitions in practice is challenging and often leads to errors, as the attacker's intentions and capabilities may be unclear Additionally, unforeseen attack vectors, or unknown unknowns, can threaten privacy As a result, making definitive statements based on traditional definitions is difficult, as they depend on assumptions that cannot be guaranteed with certainty.

In contrast, the adoption of differential privacy provides two remarkable guarantees:

Differential privacy ensures the protection of all types of individual information, regardless of the attacker's intentions, such as re-identifying targets or inferring sensitive attributes This robust framework safeguards against various threats, eliminating the necessity to focus on the specific goals of potential attackers.

Differential privacy ensures strong protection against attacks, even when the attacker has specific knowledge about individuals in the database or tries to create fake users.

Privacy guarantees extend to individuals whom the attacker is unfamiliar with, ensuring their protection regardless of the attacker's level of knowledge

By embracing differential privacy, organizations can confidently safeguard personal information across various scenarios, irrespective of the attacker's goals or knowledge about the data

Differential privacy offers a flexible numeric parameter that can be adjusted to enhance privacy protection, setting it apart from traditional privacy concepts For instance, in k-anonymity, the parameter k indicates that each record in the dataset is similar to at least k-1 other records However, the value of k alone does not clearly define the extent of privacy protection provided.

The connection between the value of k and the true privacy of a dataset is weak and often subjective, lacking formal justification This challenge is amplified when considering other traditional privacy definitions.

Differential privacy provides a robust framework for quantifying the maximum information an attacker can gain, represented by the parameter ε For example, if ε equals 1.1, it indicates that an attacker with an initial 50% belief in a target's presence can only increase their certainty to a maximum of 75% Although determining the exact value of ε can be challenging, it allows for formal reasoning and interpretation, enhancing the understanding of privacy risks in data sets.

Differential privacy offers remarkable flexibility, allowing modifications to its foundational statements For instance, one can substitute "their target is in the dataset" with any individual-related assertion, and the phrase "no matter what the attacker knows" can be added for enhanced precision These features collectively strengthen differential privacy compared to previous definitions, particularly in attack modeling and quantifying privacy guarantees.

In a situation where you have a dataset to share with two trusted individuals, Alice and Bob, you can provide them with different versions of the anonymized data while maintaining the same privacy standards Although you trust both equally, their unique interests in the data necessitate tailored versions of the dataset for each individual.

A significant privacy risk emerges when Alice and Bob collaborate to compare their received data, as this could compromise k-anonymity In many privacy frameworks, merging two k-anonymous datasets does not ensure the preservation of anonymity, potentially allowing for the reidentification of individuals or even the full reconstruction of original data.

Differential privacy safeguards data even when multiple parties, like Alice and Bob, combine their datasets By applying a privacy parameter ε to each instance of differentially private data, the overall privacy remains intact, although it becomes weaker, represented by a new parameter of 2ε While some information can still be extracted from their collaboration, the potential information gain is quantifiable, a characteristic referred to as composition.

The concept of composition, while seemingly improbable, is essential for organizations that manage various data applications, including publishing statistics, releasing anonymized data, and training machine learning algorithms It allows organizations to effectively control risk as new use cases and processes develop, establishing a strong framework for privacy management.

The promise (and not promised) of Differential Privacy

Adapted from [6]: Differential Privacy: A Primer for a Non-technical Audience

(2017) of professor Kobbi Nissim and his team

Researchers conducted a survey to investigate the link between socioeconomic status and medical outcomes in various U.S cities Participants, including John, completed a questionnaire regarding their living conditions, finances, and medical histories John is particularly concerned about the potential re-identification of individuals from de-identified data, fearing that sensitive information, such as his HIV status and income, could be exposed This breach of privacy could negatively impact his life insurance premiums or mortgage applications in the future.

Differential privacy seeks to safeguard John’s privacy in real-world situations by replicating the protections he would have if he opted out of data sharing Consequently, the insights gained about John through differential private computations are fundamentally restricted to what could be inferred from the data of others, without incorporating John’s own data into the analysis.

The guarantee of differential privacy extends not only to John but to all individuals who contribute their information to the analysis While a detailed mathematical definition of this guarantee involves complex technical concepts, this document aims to present intuitive examples that clarify various aspects of differential privacy in an accessible manner.

Alice, a friend of John, is aware of his habit of drinking several glasses of red wine with dinner Upon discovering a medical research study that indicates a positive correlation between red wine consumption and a specific type of cancer, she may deduce that John faces an increased risk of developing cancer due to his drinking habits.

The publication of the medical research study results may initially appear to have caused a privacy breach by Alice, as it allowed her to deduce John’s elevated cancer risk However, it is important to note that Alice could have inferred this information even if John had not participated in the study, highlighting that this risk is present in both scenarios Thus, the potential for such inferences exists for everyone, regardless of their participation in the study or the sharing of personal data.

Differential Privacy is a powerful tool for protecting privacy when sharing datasets, yet it is important to recognize that some information inference situations can still pose risks, even without the published data The limitations of Differential Privacy mean it does not address these scenarios It primarily functions on the premise that it prevents attackers from changing their beliefs about the target after accessing the dataset, irrespective of their prior knowledge or access to the data.

Formal definition of Differential Privacy

Data curator A data curator manages the collected data throughout its life cycle

Data management encompasses various processes such as data sanitization, annotation, publication, and presentation, all aimed at ensuring reliable data reuse and preservation In the context of data protection, the data curator plays a crucial role in safeguarding the privacy of individuals represented in the data, ensuring that their confidentiality is maintained.

Adversary The adversary represents a data analyst that is interested in finding out

In the realm of data privacy, even legitimate users of a database can be considered adversaries, as their analyses may compromise the privacy of individuals within the dataset.

L1-norm The L1-norm of a database D is denoted by | |𝐷 | | 1 It measures the size of D (e.g.: number of records it contains) and can be defined as

L1-distance The L1-distance between two databases 𝐷 1 𝑎𝑛𝑑 𝐷 2 is ||𝐷 1 – 𝐷 2 || 1 It measures how many records differ between both databases and can be defined as:

Neighboring databases Two databases 𝐷 1 𝑎𝑛𝑑 𝐷 2 are called neighboring if they differ only in at most ONE element This can be expressed by this:

Differential Privacy is an abstract principle that requires a specific mechanism or algorithm to effectively implement its mathematical guarantees This mechanism is responsible for releasing statistical information about a dataset while ensuring privacy protection.

Differential privacy refers to a characteristic of certain randomized algorithms, particularly those with a discrete probabilistic space To understand this concept, it is essential to first define what a randomized algorithm is, utilizing the concept of a probability simplex for formalization.

Definition 3.1 (Probability simplex) Given a discrete set B, the probability simplex over B, denoted 𝛥(𝐵) is the set: Δ(𝐵) = 𝑥𝜖𝑅 |𝐵| : ∀𝑖, 𝑥 𝑖 ≥ 0, 𝑎𝑛𝑑 ∑ 𝑥 𝑖 = 1

Definition 3.2 (Randomized Algorithms) A randomized algorithm M with domain A and range B is an algorithm associated with a total map M: A → ∆(B) On input a

∈ A, the algorithm M outputs 𝑀(𝑎) = 𝑏 with probability (𝑀(𝑎)) 𝑏 for each b ∈ B The probability space is over the coin flips of the algorithm M

A randomized algorithm is defined as a deterministic algorithm that utilizes two inputs: a dataset and a string of random bits This concept is closely linked to differential privacy, which pertains to the probability linked to the algorithm's randomness while the dataset remains unchanged A key element of this definition is that the probability space is derived from the algorithm M's coin flips, emphasizing its significance as the source of randomness.

Definition 3.3 Let 𝜀 > 0 Define a randomized function M to be (𝜀) differentially private if for all neighboring input datasets 𝐷 1 𝑎𝑛𝑑 𝐷 2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

𝑃𝑟[𝑀(𝐷 2 ) ∈ 𝑆] ≤ 𝑒 𝜀 where the probability is taken over the coin tosses of M

The definition also implies a lower bound: since we can interchange D1 and D2 mutually:

We can also obtain the constraint using natural logarithm:

Working with 𝑒 𝜀 , and thus logarithmic probabilities, has also other advantages in the practical application with computers (adapted from [25]):

• Computation Speed: The product of two probabilities corresponds to an addition in logarithmic space and multiplication is computationally more expensive than addition

Using logarithmic probabilities enhances numerical stability when dealing with very small probabilities, as it mitigates rounding errors that can occur with standard probabilities due to computer approximations of real numbers.

Many probability distributions, particularly those from which random noise is derived, exhibit an exponential form By applying the logarithm to these distributions, the exponential function is removed, allowing for calculations to focus solely on the exponent.

Differential Privacy (DP) requires stringent privacy measures for data protection, but excessive noise addition can significantly reduce the data's informational value To address this challenge, various relaxed versions of DP have been introduced, with one of the most widely recognized being (𝜀, 𝛿) − Differential Privacy.

Define a randomized function M to be (𝜀) differentially private if for all neighboring input datasets 𝐷 1 𝑎𝑛𝑑 𝐷 2 differing on at most one element, and ∀𝑆 ⊆ 𝑅𝑎𝑛𝑔𝑒 (𝑀), we have:

In this case the two parameters 𝜀 and δ control the level of privacy

Pure differential privacy, characterized by δ = 0, represents the strongest form of data protection, while approximate differential privacy, with δ > 0, acknowledges a measurable risk of data leakage The value of δ reflects the probability of such leaks, which can raise concerns in privacy definitions However, there is no one-size-fits-all solution for privacy issues; effective risk management is essential Thus, considering δ as a realistic risk factor allows for better scenario planning and informed decision-making in privacy strategies.

To ensure effective privacy protection, δ should be kept small This is intuitive, as a large δ can compromise privacy even when perfect secrecy (𝜀 = 0) is achieved; a mechanism that is (0, δ)-differentially private may still breach privacy with a high probability A widely accepted guideline for selecting δ in a database containing n records is to keep δ less than 1.

An (𝜀, δ)-Differential Privacy (DP) mechanism allows for a probability δ of disclosing privacy for each record in the database However, it's important to note that (𝜀, δ)-DP offers significantly weaker privacy guarantees compared to (𝜀, 0)-DP, even when δ is minimal relative to the database size n.

Important concepts of Differential Privacy

This property aids in determining the necessary level of noise perturbation within the differential privacy mechanism, measuring how much the output of a function varies when its input is modified.

It demonstrates the maximum differences between the query results of neighboring databases that are utilized in one of the differentially private mechanisms The formal definition:

The COUNT function illustrates that adding a row to a dataset can increase the query's output by a maximum of one If the new row contains the desired attribute, the count increases by one; if it does not, the count remains unchanged Conversely, removing a row can lead to a decrease in the count.

So far, we have focused solely on global sensitivity as a metric Our definition involves comparing any two neighboring datasets, but this method may be excessively cautious In practice, when applying our differential privacy mechanisms to a real dataset, it is more pertinent to examine the specific neighbors of that dataset.

Local sensitivity is a crucial concept that evaluates a function's response to changes in a specific dataset, taking into account its neighboring datasets Unlike global sensitivity, which provides a broader perspective, local sensitivity focuses on the particular dataset being queried, making it essential to consider the context of that dataset when discussing its sensitivity.

Formally, the local sensitivity is defined:

Local sensitivity provides an effective method for determining finite bounds on the sensitivity of specific functions, particularly when establishing their global sensitivity proves difficult A prime example of this is the mean function.

We calculated differentially private means by splitting the query into two distinct components: a differentially private sum for the numerator and a differentially private count for the denominator By utilizing sequential composition and post-processing, which will be discussed later, we ensure that the resulting quotient maintains differential privacy.

The effectiveness of a mean query's output is influenced by the size of the dataset, particularly when rows are added or removed Assuming the worst-case scenario of a dataset with just one entry to bound the global sensitivity of a mean query is overly pessimistic for larger datasets Therefore, utilizing the "noisy sum over noisy count" methodology proves to be a more beneficial approach.

An effective privacy strategy must consider the challenges of composition, which involves executing multiple queries on the same dataset These queries may be independent, dependent, or interact with each other's results Differential Privacy (DP) can manage composition within a single DP mechanism or across multiple mechanisms, although the parameters (𝜀, 𝛿) may deteriorate in the process Composition can occur in two forms: sequential and parallel.

Figure 2 Visualize how Sequential and Parallel Composition works – take from

This helps bound the total privacy cost of releasing multiple results of differentially private mechanisms on the same input data Formally, the sequential composition theorem for differential privacy:

• Then 𝐺(𝑥) = (𝐾 1 (𝑥), 𝐾 2 (𝑥)) which releases both results satisfies ϵ 1 + ϵ 2 -differential privacy

Sequential composition is essential for differential privacy as it allows algorithms to access data multiple times This property is particularly significant when conducting various analyses on a single dataset, as it helps individuals limit the overall privacy cost associated with their participation in these analyses While sequential composition provides an upper bound on privacy costs, the actual cost of two differentially private releases may be lower, but never exceeds this bound.

Parallel composition serves as an alternative to sequential composition for calculating the total privacy cost of multiple data releases This method involves dividing the dataset into disjoint chunks, allowing a differential private mechanism to operate on each chunk independently Since the chunks are non-overlapping, each individual's data is included in only one chunk, ensuring that even with k chunks and k executions of the mechanism, each person's data is processed just once.

• And we split dataset X into k disjoint chunks

• Then the mechanism which releases all the results 𝐾(𝑥 1 ), … , 𝐾(𝑥 𝑘 ) satisfies 𝜖 -differential privacy

Post-processing in differential privacy ensures that arbitrary computations on the output of a differentially private mechanism are safe, as they do not compromise privacy protection It is acceptable to enhance the output by reducing noise or refining the signal, such as replacing negative results with zeros for queries that should not yield negative outcomes Many advanced differentially private algorithms leverage post-processing to enhance result accuracy and minimize noise.

• Then for any (deterministic or randomized) function 𝑔, 𝑔(𝐾(𝑋)) satisfies 𝜖 -differential privacy.

Foundation mechanisms of Differential Privacy

5.1 Local Differential Privacy and Global Differential Privacy

Data privacy mechanisms can be categorized into local and global approaches In local mechanisms, individuals add noise to their own data before sharing, as there is no trusted data curator Conversely, global mechanisms involve data perturbation at the output stage, requiring users to place their trust in a centralized data curator.

The local approach prioritizes privacy by introducing significant noise to individual data points, making them less useful in isolation However, when aggregated in large quantities, this noise can be minimized, allowing for meaningful analysis of the dataset Conversely, the global approach tends to yield more accurate results since it operates on cleaner data, requiring only minimal noise addition at the conclusion of the analysis process.

Figure 3 Visualize how Local Privacy and Global Privacy works - take from [25]

In this section we introduce one of the most basic mechanisms in differential privacy The Laplace mechanism involves adding random noise that adjusts to the

Laplace distribution with mean 0 and scales GS(f)

𝜀 and add independently to each query response, thus making sure that every query is perturbed appropriately To analyze the Laplace Mechanism, we first need to define Laplace distribution

Figure 4 The Laplace mechanism with multiple scales

The Laplace mechanism: Given any function f, the Laplace mechanism is defined as:

Where 𝑌 𝑖 are independent and identically distributed random variables drawn from

The Laplace mechanism is effective for data where additive noise has minimal impact on utility, such as in counting queries; however, there are instances where introducing noise can render the results ineffective.

In a digital goods auction scenario, a seller with an unlimited supply of items, such as digital movies, aims to determine the optimal price to maximize profits With four potential buyers—three willing to pay $1 and one willing to pay $4.01—the seller's profit is $4 at a price of $1, but increases to $4.01 at $4.01 However, setting the price at $4.02 results in zero profit, highlighting that the method for determining the optimal fixed price is sensitive to minor changes in buyer willingness to pay This indicates that a single bidder can significantly influence the optimal pricing strategy.

The traditional Laplace mechanism primarily emphasizes numerical outputs while directly introducing noise to the results However, to achieve precise answers without added noise while maintaining differential privacy, the exponential mechanism offers a viable solution by enabling the selection of the optimal element from a dataset while ensuring privacy is preserved.

The analyst determines the "best" element by using a scoring function that assigns scores to each element in a given set This mechanism ensures differential privacy by aiming to maximize the score of the selected element, which may result in returning an element that does not have the highest score.

The exponential mechanism satisfies e-differential privacy:

• The analyst selects a set R of possible outputs

• The analyst specifies a scoring function 𝑢: 𝒟 × ℛ → 𝑅 with global sensitivity Δ𝑢

• The exponential mechanism outputs 𝑟 ∈ ℛ with probability proportional to: 𝑒𝑥𝑝 ( 𝜀𝑢(𝑥,𝑟)

The exponential mechanism differs significantly from previous methods, such as the Laplace mechanism, as it ensures that its output is always a member of the set R This characteristic is particularly beneficial when choosing an item from a finite set, where introducing noise could lead to nonsensical results For instance, when selecting a date for a major meeting, it is crucial to utilize each participant's calendar to maximize attendance while ensuring differential privacy Adding noise to a date could result in shifting a Friday to a Saturday, potentially causing more conflicts Therefore, the exponential mechanism is ideal for such scenarios, as it allows for the selection of a date without introducing any noise.

The exponential mechanism offers a versatile approach to defining e-differentially private mechanisms by selecting an appropriate scoring function, u By analyzing the sensitivity of this scoring function, one can easily establish the proof of differential privacy.

The exponential mechanism, while useful for theoretical lower bounds in differential privacy, can lead to looser performance guarantees and is often challenging to implement Although it demonstrates the existence of differentially private algorithms, practical implementations frequently achieve similar outcomes through alternative methods.

Notable mechanisms for Time-series data

6.1 Laplace mechanism (LPA – Laplace Perturbation Algorithm)

To ensure differential privacy when a trusted server is involved, the algorithm incorporates appropriately selected noise into the actual responses This noise is derived from the Laplace distribution, represented as 𝐿𝑎𝑝(𝜆), which is characterized by its probability density function (PDF).

𝐿𝑎𝑝(𝜆) has mean 0 and variance 2𝜆 2 Also denote 𝐿𝑎𝑝 𝑛 (𝜆) to be a vector of n independent 𝐿𝑎𝑝(𝜆) random variables

The LPA algorithm processes a query sequence Q and utilizes a parameter 𝜆 to manage Laplace noise Initially, it accurately computes the true answers, denoted as Q(I), and subsequently introduces independent Laplace noise, specifically 𝐿𝑎𝑝(𝜆), to each answer in Q(I) Ultimately, the algorithm generates and outputs these perturbed results.

Differential privacy is guaranteed if the parameter 𝜆 of the Laplace noise is calibrated according to the L1 sensitivity of Q

Time-series data is characterized by temporal correlations at each timestamp; however, the LPA solution overlooks these correlation characteristics As a result, there is a significant risk of introducing nonsensical noise, making it highly unlikely to manage the added noise to a level that maintains data utility.

6.2 Discrete Fourier Transform (DFT) with Laplace mechanism (FPA – Fourier Perturbation Algorithm)

To tackle the correlation characteristic in time-series while protecting privacy, Rastogi [10] has proposed the algorithm called FPA – Fourier Perturbation

Algorithm FPA is a compression-based method that first applies the Discrete

Fourier Transform (DFT) to the true query answers, then performs LPA to Fourier coefficients The perturbed coefficients undergo the inverse DFT (IDFT) to obtain the resulting perturbed sequence

Compression methods transition time series data into the frequency domain, resulting in correlated noise rather than independent noise Consequently, Frequency Perturbation Analysis (FPA) is more effective for manipulating time series data.

The FPA k algorithm is designed to provide accurate answers to queries under differential privacy with minimal error It achieves this by compressing the answers of a query sequence, Q(I), through an orthonormal transformation Essentially, the algorithm identifies a k-length query sequence, denoted as 𝐹 𝑘 = 𝐹 1 𝑘 , … , 𝐹 𝑘 𝑘, where k is significantly smaller than n, allowing the answers 𝐹 𝑘 (𝐼) to be utilized for the approximate computation of Q(I).

We can modify 𝐹 𝑘 (𝐼) instead of Q(I) by applying a lower level of noise, reducing it by a factor of n = k, while still maintaining differential privacy However, this introduces an additional error because 𝐹 𝑘 (𝐼) may not perfectly reconstruct Q(I), but selecting the appropriate parameters can mitigate this issue.

𝐹 𝑘 , this reconstruction error is signiflcantly lower than the perturbation error caused by adding noise directly to Q(I)

A good 𝐹 𝑘 can be found using any orthonormal transformation and we use the Discrete Fourier Transform (DFT) in our algorithm

Figure 5 The visualization process of LPA and DFT (or FPA) - take from [19]

Adapted from [3], [4] by research groups of PhD Qingqing Ye and her team

While LPA and FPA focus on perturbing the value of the time-series (which is called Value Perturbation mechanisms), there is the field of research called

Temporal Perturbation, focused on perturbing the timestamp instead of the value This area supports protecting privacy in value-critical scenarios such as health biosensor and financial time-series data

From the recent research ([3]), theory of Local Differential Privacy in the Temporal Setting, TLDP has been proposed: Given privacy budget 𝜀, a randomized algorithm

A satisfies 𝜀-TLDP if and only if for any two neighboring time series S and S’, and any possible output R of A, the following inequality holds:

The degree of privacy is controlled by the privacy budget 𝜀 Since the whole time series is released for analysis, the output R of A is simply a perturbed time series

TLDP and VLDP are distinct privacy models that originate from the concept of Local Differential Privacy (LDP), yet they can interchange under specific circumstances A proven theorem indicates that any perturbation meeting the criteria for VLDP will also comply with TLDP.

The theorem demonstrates that achieving a doubled privacy budget under Tightly-Locked Differential Privacy (TLDP) can be effectively utilized Additionally, it reveals that any temporal perturbation meeting the 𝜖-TLDP criteria for a window of size k can also fulfill the 𝜖/2-VLDP requirements for any value within that window, provided that the skewness of the value remains bounded.

To understand further this state-of-the-art theory, we will need to explore how the data utility is quantified, and three proposed mechanisms:

The cost of TLDP arises from inaccuracies in time series analysis caused by the released time series R, which deviates from S due to issues such as missing, repeating, emptying, and delaying values We quantify this cost by assessing these four specific factors.

• Missing Cost This occurs when a value 𝑆 𝑖 is missed, i.e., it does not appear in the time window starting from 𝑅 𝑖 (𝑅 𝑖+1 , 𝑅 𝑖+2 , … , 𝑅 𝑖+𝑘−1 ) For simplicity, we assume each missing value bears a unit cost of M

• Repetition Cost This occurs when a value 𝑆 𝑖 is repeatedly released in the time window starting from 𝑅 𝑖 (𝑅 𝑖+1 , 𝑅 𝑖+2 , … , 𝑅 𝑖+𝑘−1 ) For simplicity, we assume each occurrence of a repeated value bears a unit cost of N

Empty Cost refers to the situation where no value is assigned to 𝑅 𝑖 at the time it is supposed to be released, specifically at timestamp 𝑡 𝑖 To simplify the analysis, we consider that each instance of an empty release at a timestamp incurs a unit cost of E.

Delay cost arises when a value \( S_i \) is released at a later timestamp \( t_j \) (where \( j > i \)) For the sake of simplicity, we assign a unit cost of \( D \) for a one-timestamp delay, leading to a delay cost of \( D(j - i) \) for \( S_i \) being released to \( R_j \) To prevent double counting, any preceding costs take precedence over the delay cost; for instance, a repeated value incurs no delay cost.

At each timestamp 𝑡 𝑖 , the protocol probabilistically releases the value 𝑅 𝑖 drawn from 𝑆 𝑖−𝑘+1 , 𝑆 𝑖−𝑘+2 , , 𝑆 𝑖 , which are the values at the k most recent timestamps To satisfy the 𝜀-TLDP, we set the probability as:

In this article, we denote the two possible probabilities as p0 and p1 It's important to highlight that neighboring time series differ by no more than two timestamps, ensuring compliance with e-TLDP.

Figure 6 The visualization of Backward Perturbation mechanism (BPA) - take from

As opposed to Backward Perturbation mechanism which finds for each 𝑅 𝑖 a previous 𝑆 (𝑖−𝑗) to dispatch to, the Forward Perturbation mechanism dispatches each

𝑆 𝑖 to one of the 𝑅 (𝑖+𝑗) ’s in the next k timestamps Like the BPA, the method satisfies the e-TDLP:

In this mechanism, multiple instances of \( S_i \) can be sent to the same resource \( R(i+j) \), leading to the overwriting of values Consequently, only the last dispatched \( S_i \) is retained for \( R(i+j) \), while all previously sent \( S_i \) instances become unavailable.

Figure 7 The visualization of Forward Perturbation mechanism - take from [4]

Due to the suffering in data utility costs (Missing, repetition, empty, and delay) of both BPA and FPA, the researchers also propose a mechanism that is called

Threshold mechanism that could minimize the suffering

The cost analysis table could help to explain it a little more

Figure 8 The cost analysis table for BPA and FPA method - take from [4]

Figure 9 The pseudo-code for Threshold Mechanism - taken from [4]

The proposed Threshold Mechanism algorithm effectively addresses issues of Missing, Repetition, and Empty costs Detailed proofs supporting its effectiveness are provided in the referenced paper [4], highlighting its robustness despite the complexity of established methods.

6.4 STL-DP – Perturbed time-series by applying DFT with Laplace mechanism on trends and seasonality

Figure 10 The process of how STL-DP mechanism works - take from [27]

STL-DP, an innovative method published in October 2022 by Kyunghee Kim,

Minha Kim and Simon Woo presented STL-DP at ACM CIKM22-PAS, a method aimed at safeguarding privacy in time-series data within the Differential Privacy framework The key innovation of STL-DP is its unique incorporation of STL decomposition, setting it apart from existing techniques.

EXPERIMENT DESIGNS

Experiment designs

Company XYZ, a leading FMCG corporation in the EU, is strategically preparing to establish data partnerships with supermarkets and eCommerce platforms in Vietnam to enhance customer service through personalized experiences.

With this data collaboration from these data providers, company XYZ can unlock multiple useful analytics:

• Consumer segmentation for finding good/bad/churn customer groups

• Understand purchasing behavior patterns to give better and related promotion

• Forecast potential quantity of each customer group on each category to prepare better service

Based on those analytics, XYZ can recommend supermarkets and eCom platforms which customers to target for very specific and related promotions programs to upsell for both sides

1.2 Data structure aligns with data provider

Table 1 Original Shopping data table structure

Field name Data type Remarks

PseudoID for marking user and their purchase behavior It’s completely not a national ID/ royalty ID, or any ID that can identify user directly

Gender String Male/Female/Others

RoyaltyRank String Bronze/Silver/Gold/Platinum/Diamond

To address the growing concerns among FMCG competitors regarding consumer product selection, the data provider aims to deliver aggregated data at the "PurchaseDate" and "ProductCategory" levels This approach will enable insights into user behavior by tracking individual purchases on a daily basis.

• if they buy multiple products inside one category, then only one aggregated record gets exported

• if they buy in one store, multiple times per day, then only one aggregated record gets exported

Concerns regarding user privacy have led data providers to consider removing sensitive information such as "Gender," "Year of Birth," and "Loyalty Rank." In the following section, we will analyze the implications of retaining or omitting this data.

Table 2 Shopping Data table structure for the use-case

Column name Data type Note

The LocationID field in the shared dataset transparently conveys location information, while the PseudoUser ID serves to anonymize real user IDs within the system Additionally, the ProductCategory is designed to fulfill the specified purpose effectively.

Quantity Integer Aggregate volume in each category TotalAmount Integer Aggregate volume in each category of pair Quantity * Price

Each month, data providers such as supermarkets and eCommerce platforms will transmit data to XYZ company's collaborative data platform This system operates independently from XYZ's internal systems to ensure compliance with data governance standards and regulations, including the EU's General Data Protection Regulation (GDPR) and Vietnam's Personal Data Protection Law.

Concerns regarding individual privacy emerged during discussions about the data partnership, particularly related to the risk of reidentification As a result, the partners have opted to postpone execution until an appropriate solution is implemented to address these issues.

An attacker can identify a target victim within a dataset by leveraging knowledge of their purchasing behavior, such as weekend shopping habits, weekly food purchases, and consistent ice-cream orders This background information, combined with the available data, significantly increases the likelihood of successfully locating the victim.

A significant concern arises from the potential linking of customer data across various platforms, such as supermarket ABC and eCommerce platform DEF This vulnerability allows attackers to identify unique patterns associated with specific users, transforming the data platform into a target for linkage attacks.

In addition to privacy issues, there is a significant risk associated with sharing product details on invoices, as competitors can gain insights into market share through this data partnership strategy.

To effectively protect privacy in big data systems, any proposed solution must be lightweight and capable of handling the immense volume of daily transactions generated by data providers For instance, in District 7 of Ho Chi Minh City, Vietnam, with approximately 500,000 citizens, 125,000 families, and 15 large supermarkets, each supermarket has the potential to serve around 10,000 customers daily.

When addressing data privacy on the platform, it is crucial to focus on key security concerns such as unauthorized access, data manipulation, system hacking, and system availability.

Problem analysis

2.1 Revisit the GDPR related terms for data sharing

As of November 2022, the Vietnam Law of Personal Data Protection is not yet fully established, prompting a review of the case under GDPR compliance This necessitates a thorough examination of data collaboration processes involving potential Personally Identifiable Information (PII) While additional terms and conditions may apply, the focus here is on those that directly pertain to the analysis of the issue at hand, particularly in relation to GDPR requirements.

Article 5: Principles relating to processing of personal data:

• Article 5(1)(a) emphasizes the principle of lawfulness, fairness, and transparency in processing personal data

Article 5(1)(b) emphasizes the principle of purpose limitation, asserting that personal data must be collected for clear, specific, and legitimate reasons, and prohibits any further processing that contradicts these established purposes.

• Article 6(1)(a) specifies that data processing is lawful if the data subject has given consent for the processing of their personal data for one or more specific purposes

Article 6(1)(b) establishes that data processing is considered lawful when it is essential for fulfilling a contract involving the data subject or for taking actions requested by the data subject before entering into a contract.

Article 6(1)(f) permits data processing when it is essential for the legitimate interests of the data controller or a third party, provided that these interests do not conflict with the fundamental rights and freedoms of the data subject.

Article 9: Processing of special categories of personal data:

Article 9(2)(a) restricts the processing of sensitive personal data unless explicit consent is obtained from the data subject or the processing is essential for specific purposes, including the establishment, exercise, or defense of legal claims.

Article 13: Information to be provided where personal data is collected from the data subject:

• Article 13(1)(e) requires data controllers to inform data subjects about the recipients or categories of recipients of their personal data

Article 28(3)(e) emphasizes that a data processing agreement must outline the data processor's responsibility to support the data controller in meeting GDPR compliance requirements, particularly concerning the rights of data subjects and the management of data transfers.

From the regulation, these actions could be inferred from my understanding (without legal professionals consulted) while learning the Legal and Compliance course:

• Ensure that the data sharing is based on a legitimate legal basis, such as obtaining explicit consent from the consumers or fulfilling a contractual obligation

• Clearly define the specific purpose of sharing the data

• Clearly communicate to consumers how their data will be shared, the purpose of the data sharing, and any potential implications

• Share only the necessary data required for conducting data collaboration

• Exclude any personally identifiable information (PII) or sensitive data that is not directly relevant to the analysis purpose

Therefore, in the current use-case, I can infer that:

Removing personally identifiable information (PII) such as usernames, emails, phone numbers, and IDs facilitates smoother data sharing This approach allows data providers to share information without requiring explicit consent from consumers, streamlining the process significantly.

To mitigate the risk of re-identification after the removal of Personally Identifiable Information (PII) and to ensure compliance with GDPR, XYZ Company and its partners must outline specific analytical requirements rather than simply requesting "more" analysis.

To enhance data privacy, it is crucial to implement advanced techniques for anonymizing and perturbing data, effectively confusing potential attackers and malicious analysts seeking to compromise consumer information.

2.2 Potential attack models and countermeasures

Based on the mentioned theories in the literature review section, there are couple of attack models might happen in this original case (Table 1 data structure):

• Record linkage: Record linkage: Utilizing the quasi-identifier

The combination of LocationID, Gender, YearOfBirth, and LoyaltyRank effectively narrows the search for potential attackers, especially when targeting individuals within small subgroups, such as those identified as 'Others' in gender or holding a Platinum loyalty rank in specific regions This method increases the chances of successful re-identification by minimizing the number of viable candidates an attacker must evaluate.

Attribute linkage attacks pose a significant threat by allowing attackers to combine transactional data with supplementary information to identify individuals For instance, by analyzing a victim's consistent purchase of ice cream on Saturdays at a specific supermarket, alongside unique products showcased in publicly available Facebook photos, attackers can effectively narrow their search This method increases the likelihood of accurately identifying the target, highlighting the vulnerabilities associated with attribute linkage in data privacy.

Table 2's data structure enhances security by protecting individual product details through aggregation, which prevents attackers from accessing specific information Nonetheless, it is important to acknowledge the potential risks that still exist.

If an attacker acquires an individual's receipts from a span of 1-2 weeks, they can categorize the data and identify patterns or sequences within that timeframe This analysis enables them to discern the person's behavior over an entire year, revealing insights into their spending habits and routines.

If an attacker notices a consistent pattern in an individual's behavior, such as regularly purchasing one ice cream every weekend for several months, they can leverage this information to pinpoint the victim's identity.

While k-Anonymity is effective for record linkage, it falls short in addressing attribute linkage In contrast, the l-Diversity algorithm appears to offer a solution for this data use case However, literature indicates a lack of guidance on applying l-Diversity to time-series data, and its application to high-dimensional datasets presents significant challenges Additionally, data insertion activities can lead to redundancy and confusion, potentially compromising accuracy and negatively affecting database performance for data providers.

Evaluation methodology

As mentioned, the data utility after applying Differential privacy can be seen as the usefulness of the analytics Therefore, the methodology to consider can be:

• Number of correct consumers in the segmentation before and after the perturbation

• The accuracy of the forecast model on original data and on perturbed data (method forecast: Simple Linear Regression)

I will use the RMSE (Root Mean Squared Error) to calculate accuracy, one of the common error metrics that is familiar in the data science community

To audit the privacy level of the perturb dataset, there are couples of approaches:

• Privacy loss: By evaluating the common epsilon value range and measuring the noise output

• Simulation evaluation: Re-attack the perturbation dataset with data analysis techniques to see if the victim (or the specialty group) could be found or not

Because of experimental purpose, I will conduct the evaluation follow this process:

• For each epsilon, I will generate the perturb output

• Calculate RMSE of the perturb output to see the data utility

• Try to re-attack the perturb output

This evaluation aims to demonstrate the effectiveness of Differential Privacy mechanisms in concealing individual identities while identifying the optimal DP properties that strike a balance between data privacy and utility.

Privacy protection proposal

To address the specified requirements, I will apply three distinct perturbation techniques: Laplace mechanisms (LPA), Fourier Perturbation Algorithm (FPA), and STL-DP The Threshold mechanism and other temporal methods are excluded from this use case, as value perturbation is considered adequate due to the sales volume involved.

To apply these algorithms, it is necessary to plan out the prerequisite components:

Epsilon values recommended by Dwork include {0.01, 0.1, log 2, log 3}, while Boenisch suggests {5, 10} This parameter is crucial for determining the level of privacy protection, as smaller epsilon values offer stronger privacy guarantees.

• Sensitivity: The sensitivity of the data is calculated as 𝑀√𝑇𝑘

In the context of data analysis, M denotes the maximum bound within the data domain, T signifies the size of the time series, and k represents the coefficient length in the Fast Fourier Transform (FFT) Additionally, sensitivity quantifies the degree to which an algorithm's output can vary with alterations in the input data.

To ensure privacy protection while maintaining data utility, it is essential to select suitable values for epsilon and sensitivity before applying the Laplace mechanisms, FPA, and STL-DP to perturb time-series data.

EXPERIMENT IMPLEMENTATIONS

Ngày đăng: 25/10/2023, 22:15

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[3] Q. Ye et al., "Stateful Switch: Optimized Time Series Release with Local Differential Privacy", presented at IEEE International Conference on Computer Communications (INFOCOM), New York, USA, 2023 Sách, tạp chí
Tiêu đề: Stateful Switch: Optimized Time Series Release with Local Differential Privacy
[6] K. Nissim et al., "Differential privacy: A primer for a non-technical audience," Vanderbilt Journal of Entertainment & Technology Law, vol. 21, no. 1, pp. 209-275, 2018 Sách, tạp chí
Tiêu đề: Differential privacy: A primer for a non-technical audience
Tác giả: K. Nissim, et al
Nhà XB: Vanderbilt Journal of Entertainment & Technology Law
Năm: 2018
[7] R. C. Wong and A. W. Fu, Privacy-Preserving Data Publishing: An Overview, Morgan and Claypool Publishers, 2010 Sách, tạp chí
Tiêu đề: Privacy-Preserving Data Publishing: An Overview
Tác giả: R. C. Wong, A. W. Fu
Nhà XB: Morgan and Claypool Publishers
Năm: 2010
[10] V. Rastogi and S. Nath, "Differentially private aggregation of distributed time- series with transformation and encryption," in ACM SIGMOD International Conference on Management of data, Indiana, USA, 2010, pp. 735-746 Sách, tạp chí
Tiêu đề: Differentially private aggregation of distributed time- series with transformation and encryption
Tác giả: V. Rastogi, S. Nath
Nhà XB: ACM SIGMOD International Conference on Management of data
Năm: 2010
[11] J. P. Near and C. Abuah, Programming Differential Privacy. [Online]. Available: https://uvm-plaid.github.io/programming-dp/, 2022 Sách, tạp chí
Tiêu đề: Programming Differential Privacy
Tác giả: J. P. Near, C. Abuah
Năm: 2022
[12] F. Natasha, "Differential Privacy for Metric Spaces: Information-Theoretic Models for Privacy and Information-Theoretic Models for Privacy and utility with new applications to metrics domains,", Ph.D. dissertation, InstitutPolytechnique de Paris, Macquarie University, Sydney, Australia, 2021 Sách, tạp chí
Tiêu đề: Differential Privacy for Metric Spaces: Information-Theoretic Models for Privacy and Information-Theoretic Models for Privacy and utility with new applications to metrics domains
Tác giả: F. Natasha
Nhà XB: InstitutPolytechnique de Paris
Năm: 2021
[13] A. Narayanan and V. Shmatikov, "Robust De-anonymization of Large Sparse Datasets," 2008 IEEE Symposium on Security and Privacy (sp 2008),Oakland, CA, USA, 2008, pp. 111-125, doi: 10.1109/SP.2008.33 Sách, tạp chí
Tiêu đề: Robust De-anonymization of Large Sparse Datasets
Tác giả: A. Narayanan, V. Shmatikov
Nhà XB: IEEE Symposium on Security and Privacy
Năm: 2008
[16] B. C. M. Fung, K. Wang, R. Chen and P. S. Yu, "Privacy-Preserving Data Publishing: A Survey of Recent Developments," ACM Computing Surveys, vol. 42, no. 4, pp. 1-53, 2010 Sách, tạp chí
Tiêu đề: Privacy-Preserving Data Publishing: A Survey of Recent Developments
Tác giả: B. C. M. Fung, K. Wang, R. Chen, P. S. Yu
Nhà XB: ACM Computing Surveys
Năm: 2010
[18] L. Fan and L. Xiong, "Real-time aggregate monitoring with differential privacy," in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, Hawaii, USA, pp. 2169–2173, 2012 Sách, tạp chí
Tiêu đề: Real-time aggregate monitoring with differential privacy
Tác giả: L. Fan, L. Xiong
Nhà XB: Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Năm: 2012
[19] L. Fan and L. Xiong, "Adaptively Sharing Time-Series with Differential Privacy,", ArXiv, vol. abs/1202.3461, 2012 Sách, tạp chí
Tiêu đề: Adaptively Sharing Time-Series with Differential Privacy
Tác giả: L. Fan, L. Xiong
Nhà XB: ArXiv
Năm: 2012
[20] C. Dwork and A. Roth, "The Algorithmic Foundations of Differential Privacy," Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211-407, 2014 Sách, tạp chí
Tiêu đề: The Algorithmic Foundations of Differential Privacy
Tác giả: C. Dwork, A. Roth
Nhà XB: Foundations and Trends® in Theoretical Computer Science
Năm: 2014
[21] C. Dwork, A. Smith and J. Ullman, "Exposed! A Survey of Attacks on Private Data," Annual Review of Statistics and Its Application, vol. 4, pp. 61-84, Annual Reviews, 2016 Sách, tạp chí
Tiêu đề: Exposed! A Survey of Attacks on Private Data
Tác giả: C. Dwork, A. Smith, J. Ullman
Nhà XB: Annual Reviews
Năm: 2016
[22] C. Dwork, "Differential Privacy: A Survey of Results," in Theory and Applications of Models of Computation. TAMC 2008. Lecture Notes in Computer Science, vol. 4978, 2008 Sách, tạp chí
Tiêu đề: Theory and Applications of Models of Computation. TAMC 2008. Lecture Notes in Computer Science
Tác giả: C. Dwork
Nhà XB: Springer
Năm: 2008
[23] C. Dwork, F. McSherry, K. Nissim and A. Smith, "Calibrating noise to sensitivity in private data analysis," in Proceedings of the Third Conference on Theory of Cryptography, New York, USA, 2006, pp. 265-284 Sách, tạp chí
Tiêu đề: Proceedings of the Third Conference on Theory of Cryptography
Tác giả: C. Dwork, F. McSherry, K. Nissim, A. Smith
Năm: 2006
[25] F. Boenisch, "Differential Privacy: General Survey and Analysis of Practicability in the Context of Machine Learning,", M.Sc. thesis, Freie Universitọt Berlin, 2019 Sách, tạp chí
Tiêu đề: Differential Privacy: General Survey and Analysis of Practicability in the Context of Machine Learning
Tác giả: F. Boenisch
Nhà XB: Freie Universität Berlin
Năm: 2019
[1] M. Barbaro and T. Zeller, "The New York Times - A Face Is Exposed for AOL Searcher No. 4417749," 9 8 2006. [Online]. Available:https://www.nytimes.com/2006/08/09/technology/09aol.html. [Accessed 06 2023] Link
[14] A. Machanavajjhala, J. Gehrke, D. Kifer and M. Venkitasubramaniam, "L- diversity: privacy beyond k-anonymity," 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA, 2006, pp. 24-24, doi:10.1109/ICDE.2006.1 Link
[15] N. Li, T. Li and S. Venkatasubramanian, "t-Closeness: Privacy Beyond k- Anonymity and l-Diversity," 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 2007, pp. 106-115, doi:10.1109/ICDE.2007.367856 Link
[24] C. Dwork, G. N. Rothblum and S. Vadhan, "Boosting and Differential Privacy," 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, Las Vegas, NV, USA, 2010, pp. 51-60, doi: 10.1109/FOCS.2010.12 Link
[29] G. Wright, "TechTarget - RFM analysis (recency, frequency, monetary)," [Online]. Available:https://www.techtarget.com/searchdatamanagement/definition/RFM-analysis Link

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN