Zum Hauptinhalt springen

KC-Slice: A dynamic privacy-preserving data publishing technique for multisensitive attributes

Bamiro, B. A. ; Oguntuase, James A. ; et al.
In: Information Security Journal: A Global Perspective, Jg. 26 (2017-05-04), S. 121-135
Online unknown

KC-Slice: A dynamic privacy-preserving data publishing technique for multisensitive attributes. 

Privacy preservation methods for anonymizing multiple sensitive attributes (MSA) data in the field of privacy-preserving data publishing (PPDP) mostly seek enforcement of the -diversity privacy model on MSA coupled with quasi-identifier (QID) generalization and tuple suppression, resulting in high data degradation of the published releases. Most existing work produces static releases that are not dynamic and web-based. In this article, we propose KC-Slice, which is amodified LKC-privacy model and slicing technique, for anonymizing MSA data dynamically, to produce releases that preserve the dataset content from most attack models and reduce data degradation, through cell suppression and QID random permutation. Experimental results and evaluation using data metrics and information entropy show remarkable reduction in data degradation and suppression ratio.

Keywords: Anonymization; privacy-preserving; data utility; data publishing and multi-sensitive attributes

1. Introduction

Government regulations, synergies among corporations, and public interests necessitate the publishing and sharing of huge repositories of digital information. In situations where this information contains personally identifiable information, privacy of real data owners may be violated. Bertino, Lin, and Jiang ([3]) gave two definitions of privacy, as "The right of an individual to be secured from unauthorized disclosure of information about oneself that is contained in an electronic repository" and "The right of an entity to be secured from unauthorized disclosure of sensible information that are contained in an electronic repository or that can be derived as aggregate and complex information from data stored in an electronic repository." Fung, Wang, Fu, and Yu ([9]) gave a vivid privacy violation scenario when two computer disks in 2007 containing names, addresses, birth dates, and national insurance numbers for 25 million people went missing while being sent from one British government to another. Xu, Ma, Tang, and Tian ([27]) reported that 87% of the population of the United States can be uniquely identified by a given dataset published for the public, which deeply reflects privacy violation in the publishing scenario.

The public is deeply concerned about their privacy and the consequences of sharing and publishing their personal information (Fung et al., [9]). Though there are laws prohibiting privacy disclosures, these laws may not effectively curb individuals who are not law-abiding or situations where data with sensitive information gets into the wrong hands. To provide technological support, researchers in the field of privacy-preserving data publishing (PPDP) have been developing methods and tools for publishing and sharing data while ensuring privacy preservation of the data content, which transitively protects the real data owners. A typical PPDP model, shown in Figure 1, has two phases: a data collection phase and a data publishing phase.

Graph: Figure 1. A simple model of PPDP.

In the data collection phase the data holders (who could be government establishments and institutions, corporate organizations, private companies, etc.) keep records of their clients and customers (the data owners) in an electronic repository due to some business or social interactions. The data holder seeks the service of the data publisher in the publishing phase to anonymize the data before releasing it to data recipients who may then conduct data analytics and mining on the published release.

The taxonomy of PPDP based on review, shown in Figure 2, shows anonymization procedures for preserving privacy carried out through different privacy models and techniques.

Graph: Figure 2. Taxonomy of privacy-preserving data publishing methodologies.

Privacy models reported in the literature include, among others, -anonymity proposed by Samarati and Sweeney ([25]) to thwart record and attribute linkage attack. Machanavajjhala, Kifer, and Gehrke ([19]) proposed-diversity to thwart homogeneity and background knowledge attack. -closeness was proposed by Li, Li, and Venkatasubramanian ([13]) to prevent skewness and similarity attack. e-differential privacy was proposed by Dwork ([7]). Mohammad, Fung, Hung, and Lee ([22]) proposed LKC-privacy for anonymizing high-dimensional data. The anonymization techniques used for enforcement of these privacy models include generalization and suppression and their variants reported by Chen ([5]). Bucketization was reported by Fung, Wang, Chen, and Yu ([8]). Slicing technique was proposed by Li, Li, Zhang, and Molloy ([15]).

The application of these models and techniques requires that the attributes of the dataset be classified based on sensitivity. Four classifications of dataset attributes have been adopted by researchers, namely, explicit identifier, quasi-identifier (QID), sensitive attributes (SA), and nonsensitive attributes. Explicit identifiers such as full name, social security number, and passport number, which explicitly or unambiguously identify record owners, are removed from the dataset (Fung et al., [8]; Mohan, Phanindra, & Prasad, [23]). Quasi-identifiers such as age, sex, height, zip code, and state, which indirectly identify a person or could potentially identify a record owner are anonymized because these attributes when taken together or linked with external information could re-identify individual record owners (Albouna, Clifton, & Malluhi, [2]; Mogre, Agarwal, & Patil, [21]; Singh & Parihar, [26]). Sensitive attributes (SAs) such as diseases, salary, drug name, disability status, religion, and political affiliation, are any class of information whose unauthorized disclosure could be embarrassing or detrimental to the individual (Minelli, Chambers, & Dhiraj, [20]) and must not be linkable to him or her (Albouna et al., [2]). Nonsensitive attributes such as country and town are usually ignored as posing no privacy threat. Most anonymization operations are performed on the QIDs and the sensitive attributes to thwart different attack models.

Han, Luo, Lu, and Peng ([11]) and Gal, Chen, and Gangopadhyay ([10]) observed that most work on PDPP that applied the aforementioned privacy models and techniques to thwart variants of attack models focused on single sensitive attribute (SSA) dataset anonymization, and that very little work exists on datasets with multiple sensitive attributes (MSAs). This work focused on privacy preservation of categorical MSAs while reducing data degradation and suppression ratio by proposing KC-Slice, a model that combines the features of the LKC-privacy model and slicing technique.

The remainder of this article includes Section 2, which focus on related work, Section 3, which describes the KC-Slice technique, Section 4, which shows result and discussion, and Sections 5 and 6, which reveal the contribution and recommendation of this work, respectively.

2. Related work

To preserve privacy in datasets with MSA, Ye, Liu, Lv, and Feng ([28]) applied decomposition which selected one of its MSAs as primary sensitive attribute (PSA) subject to an ()-diversity privacy model which was enforced through noise addition. Das and Bhattacharyyu ([6]) observed that decomposition is not a dynamic publishing scenario, degrades data utility through noise addition, and enforces diversity on PSAs. To address this drawback, Das and Bhattacharyyu ([6]) used decomposition+, which was dynamic with less data utility degradation. This technique not suitable for high-dimensional datasets known to suffer from curse of dimensionality. Liu, Jia, and Han ([17]) used the New K-anonymity based on -diversity where K-anonymized QID record was linked with K-number of sensitive attributes. The cost of privacy preservation precipitated high suppression ratio consequent upon diversity enforcement. Han et al. ([11]) applied the SLicing On Multiple Sensitive (SLOM) and MSB-KACA algorithm based on -diversity for privacy preservation of MSA of a dataset. The quasi-identifier values were generalized based on the k-anonymity principle, and the sensitive values were sliced and bucketized to satisfy the -diversity requirement. This approach may lead to a large suppression ratio and information loss due to tuple suppression of sensitive attributes to enforce -diversity on the one hand and the generalization of quasi-identifier attributes on the other. High data degradation may be the resultant trade-off for privacy preservation. Liu, Shen, and Sang ([18]) used the MNSACM method, which was based on clustering and multisensitive bucketization for anonymizing numerical multisensitive attributes of a dataset. The numerical sensitive attributes were placed in multiple groups such that every sensitive attribute corresponds to a single dimension of the multidimension bucket. This approach has not been implemented on a real dataset and an algorithm for it has not been proposed. Onashoga, Bamiro, Akinwale, and Oguntuase ([24]) observed that most PPDP approaches anonymize the dataset with categorical MSAs through the enforcement of an i-diversity privacy model and its variants using generalization and suppression, leading to high suppression ratio and distortion, thereby degrading the utility and quality of the released version.

Table 1 gives a summary of the strengths and weaknesses of related work.

Table 1. Summary of related MSA algorithms.

Algorithm /authorsMethodsPrivacy modelType and number of SAStrengthsWeaknesses
Decomposition (Ye, Liu, Lv, & Feng, 2009)Partitioning and noise addition(

)- diversity
Categorical sensitive attribute (2)1.QID attributes not generalized1. Data distortion by random noise addition(Das & Bhattacharyyu, 2010)
2. SAs not suppressed2. Unsuitable for dynamic releases(Das & Bhattacharyyu, 2010)
3. QID share unions of SAs3. Unequal priority among SAs based on diversity
3. Enhanced data utility and privacy
Decomposition+(Das & Bhattacharyyu, 2010)Partitioning and noise addition(

)- diversity
Categorical sensitive attribute (2)1. Suitable for continuous releaseData distortion by noise addition
2. Improved noise addition2. Unequal priority among SAs based on diversity
3. Enhanced data utility and privacy3. Susceptible to similarity, homogeneity
New K-anonymity algorithm (Liu et al., 2012)Generalization and suppression

and

Categorical sensitive attribute (3)Considers sensitivity of sensitive attributeApplicable to static dataset, not large dataset
Low time complexityHigh suppression ratio on QIDs
Entropy of SAs within equivalence classSusceptible to similarity, homogeneity, and probabilistic attack
High degree of security
SAs are untouched
SLOMs (Han et al., 2013)Slicing and generalization

and

Categorical sensitive attribute (4)Anonymizes higher number of sensitive. attribute in a datasetHigh suppression ratio due to QID generalization and i-diversity enforcement on SAs
Improved data utility and privacyHigh cost of execution time
Does not consider the sensitivity of the SAs
MNSACM method (Liu et al., 2015)Clustering and multisensitive bucketizationNot availableNumerical sensitive attributes (2)Anonymization of numerical sensitive attributesHas not been implemented on real datasets

3. Preliminaries

This section describes the approach of KC-Slice, illustration of its operations, implementation algorithms, and metric.

3.1. Implementation environment and dataset

This research work was carried out on a 32-bit Windows 7 OS running on a 1.6-GHz Intel processor, with 2.0 GB RAM and 280 GB hard disk storage. KC-Slice was implemented using PHP, Javascript, Ajax, MySQL, and Apache Server technologies. The Adult dataset from the University of California Irwin (UCI) Machine Learning Repository (Center for Machine Learning and Intelligent System; Blake & Merz, [4]) was used for implementation and experiments. Twelve attributes of this dataset were used.

3.2. KC-Slice

This work introduces KC-Slice as a new approach for anonymizing datasets with categorical MSAs. It modifies the features of the LKC-privacy model and slicing technique coupled with cell suppression. It performs the anonymization process in phases.

The first phase classifies attributes of the input dataset into sensitive attributes (SAs) and quasi-identifier attributes (QIDs). Four attributes were classified as SAs and eight others as QIDs. One attribute value within each SA column was taken as high sensitive attribute (HSA) and others as low sensitive attributes.

The second phase partitions the dataset horizontally into buckets of size K. Each bucket contains K distinct numbers of tuples. The frequency distributions of the SAs were considered in the creation of buckets. Buckets that were skewed on SAs are merged with other buckets. Privacy violation checks were performed on the HSAs within each SA column based on the privacy threshold C and size of bucket K. Any set of HSAs within an SA column that violates this privacy threshold will be suppressed to the degree of its violation of the stated threshold. The resulting attribute values within each SA column are grouped into a cell with their counts. A single sensitive index (SID) number is generated for the cell. The SA cells are then merged.

The third phase performs correlation operations on QIDs. Two sets of highly correlated QIDs are concatenated together with one SID for each QID column. The QID columns are randomly permuted to break the link between the QID attributes.

KC-Slice will publish the output of phases three and four for each created bucket.

3.3. Anonymization illustration

KC-Slice operations are illustrated here with the following tables. Table 2 shows the original dataset, where education, disease, relationship, and occupation are taken as SAs. The HSAs for each SA are some-college, HIV, not-in-family, and Exec-managerial, respectively. The correlated QIDs are concatenated as shown in Table 3.

Table 2. Original data.

HrsAgeGenderSalaryZip codeStateEducationDiseaseRelationshipOccupation
4040M90,00012411LagosSome-collegeFluOther-relativeOther-service
4030F120,00022311OgunPreschoolHIVWifeTransport-moving
4025M60,00032243OyoSome-collegeMalariaHusbandExec-managerial
4050M200,00042411OsunHS-gradTyphoidNot-in-familyAdm-clerical
4047M50,00032278OyoSome-collegeHIVHusbandAdm-clerical
4038F250,00042344OsunBachelorsHIVNot-in-familyExec-managerial
4043F125,00052451EkitiBachelorsHIVUnmarriedAdm-clerical
4260M70,00032266EdoPreschoolFLUNot-in-familyExec-managerial
3055F300,00012523LagosHS-gradDiabetesHusbandOther-service
1049M58,00022629OgunSome-collegeHIVOwn-childExec-managerial

Table 3. Correlated quasi-attributes.

Hrs/salaryAgeGenderZip code/state
40, 90,00040M12411, Lagos
40, 120,00030F22311, Ogun
40, 60,00025M32243, Oyo
40, 200,00050M42411, Osun
40, 50,00047M32278, Oyo
40, 250,00038F42344, Osun
40, 125,00043F52451, Ekiti
42, 70,00060M32266, Oyo
30, 300,0055F12523, Lagos
10, 58,00049M22629, Ogun

The resulting SAs with SIDs after cell suppression of HSAs that violated the privacy threshold are shown in Table 4. Table 5 shows the concatenation of QIDs and SIDs. Table 6 is the randomly permuted version of Table 5. KC-Slice will publish Table 4 and Table 6.

Table 4. Clustered SAs after enforcement of privacy threshold.

SIDSAs
011#####(1), Bachelors(2), HS-grad(2), Preschool(2), Some-college (3)
021#####(2), Diabetes(1), FLU(2), HIV(3), Malaria(1), Typhoid (1)
031Other-relative(1), Wife(1), Husband(3), Not-in-family(3), Unmarried(1), Own-child(1)
041#####(1), Adm-clerical(3), Exec-managerial(3), Other-service(2), Transport-moving (1)

Table 5. Concatenation of correlated QIDs and SIDs of SAs.

Hrs/salaryAgeGenderZip code/state
40, 90000, 01140,021M, 03112411, Lagos, 041
40, 120000, 01130,021F, 03122311, Ogun, 041
40, 60000, 01125,021M, 03132243, Oyo, 041
40, 200000, 01150,021M, 03142411, Osun, 041
40, 50000, 01147,021M, 03132278, Oyo, 041
40, 250000, 01138,021F, 03142344, Osun, 041
40, 125000, 01143,021F, 03152451, Ekiti, 041
42, 70000, 01160,021M, 03132266, Oyo, 041
30, 30000, 01155,021F, 03112523, Lagos, 041
10, 58000, 01149,021M, 03122629, Ogun, 041

Table 6. Permuted QIDs and SIDs of SAs.

Hrs/salaryAgeGenderZip code/state
30, 30000, 01155,021F, 03112523, Lagos, 041
40, 120000, 01130,021M, 03122311, Ogun, 041
40, 250000, 01138,021M, 03142344, Osun, 041
40, 125000, 01143,021M, 03152451, Ekiti, 041
10, 58000, 01149,021M, 03122629, Ogun, 041
40, 200000, 01150,021F, 03142411, Osun, 041
40, 50000, 01147,021M, 03132278, Oyo, 041
40, 60000, 01125,021M, 03132243, Oyo, 041
40, 90000, 01140,021F, 03112411, Lagos, 041
42, 70000, 01160,021F, 03132266, Oyo, 041

3.4. KC-Slice algorithms

Two algorithms which were used for implementing KC-Slice are the Bucket Creator and Privacy Checker algorithms and the Bucket Partitioning and Slicing algorithms.

Algorithm 1 shows how the buckets are created from a distinct number of all sensitive attributes in all the sensitive attribute columns as shown in lines 1–16. The created buckets are checked for privacy violation and any HSAs that violate the stated threshold, specified in the KC-Slice model of implementation, will be suppressed by the amount of excess over the stated threshold as depicted in lines 17–34.

  • Algorithm 1 . Bucket Creator and PrivacyViolation Checker
  • Input : T, C, s, h
  • Output : , , ...,

• 1.

  • 2. create buckets from the number of distinct sensitive attributes within all sensitive attribute columns

• 3.

• 4.

• 5. {}

• 6.

• 7.

• 8.

• 9.

• 10.

• 11. {

• 12.

• 13. {

• 14.

• 15. }

• 16.

• 17.

  • 18. {Select (

• 19.

• 20. {

• 21.

• 22.

• 23. {))

• 24. }

• 25.

• 26.

• 27.

• 28. }

• 29. }

• 30.

  • 31. as a Bucket

• 32. }

• 33.

• 34. End

Algorithm 2 shows how the SAs in each SA column of each bucket are separately partitioned vertically and the data values grouped with frequency counts. Sensitive index (SID) numbers are generated for each group of each sensitive attributes column. The sensitive attribute groups are then merged into a table with distinct SIDs as shown in lines 1–14.

The QIDs are correlated using a correlation function. Highly correlated QID attributes are concatenated and subsequently permuted randomly as depicted in lines 15–38. The resulting table is linked with the SIDs of the combined SA group table. The final anonymized and published release consista of two tables for each bucket, the grouped sensitive attribute table and the QID attribute table.

Algorithm 2 shows how the SAs in each SA column of each bucket are separately partitioned vertically and the data values grouped with frequency counts. Sensitive index (SID) numbers are generated for each group of each sensitive attributes column. The sensitive attribute groups are then merged into a table with distinct SIDs as shown in lines 1–14.

The QIDs are correlated using a correlation function. Highly correlated QID attributes are concatenated and subsequently permuted randomly as depicted in lines 15–38. The resulting table is linked with the SIDs of the combined SA group table. The final anonymized and published release consista of two tables for each bucket, the grouped sensitive attribute table and the QID attribute table.

  • Algorithm 2 . Bucket Partitioning and Slicing
  • Input :, , ...,
  • Output :
  • 1. Begin

• 2.

• 3.

• 4. {

• 5.

• 6. {

• 7. (

• 8.

• 9.

• 10. }

• 11.

• 12.

• 13. {

• 14.

  • 15. , ...,

• 16.

• 17. {

• 18.

• 19.

• 20. {

• 21. )

• 22.

• 23. )

• 24. = )

• 25. }

• 26.

• 27. {

• 28. =

• 29.

• 30. }

• 31.

• 32. }

• 33.

• 34.

• 35.

  • 36. }, , ..., )

• 37.

• 38. End

3.5. Analysis of the algorithm

The time complexity of the algorithm for anonymization of MSAs based on input data is asymptotically given below using the Big O notation.

Step 1. The algorithm partitions the dataset into buckets of size k with time complexity .

Step 2. The algorithm checks each sensitive attributes column for privacy violation by its high sensitive attributes, by comparing the percentage privacy threshold with each HSA percentage frequency. The HSAs are suppressed by the excess over the specified threshold for each sensitive attribute column to enforce privacy. The process is iterated for all buckets created, giving a time complexity of

Graph

Step 3. The algorithm selects each sensitive attribute column from each bucket, groups it together and generates SIDs for each group, and iterates this operation for every bucket, giving a time complexity of

Graph

Step 4. The algorithm performs a correlation test on the set of two quasi-identifier attributes from among n quasi-identifier attributes, selected from each bucket. Correlated attributes are concatenated together into a single column until all the attributes are exhausted. Correlated columns are then concatenated with the SIDs. This operation is performed for each bucket, giving a time complexity of

Graph

Step 5. The algorithm permutes n amount of correlated columns and publishes the anonymized release, resulting in a time complexity of .

The time complexity of the system algorithm is

Graph

3.6. Evaluation metrics

Protecting privacy is undoubtedly a critical element in data publishing, but it is also important to preserve the utility of the published data. Li and Li ([14]) argued that privacy is an individual concept and utility is an aggregate concept and that a framework for privacy–utility trade-off in data publishing is still lacking. Bertino et al. ([3]) corroborate this fact that there is no metric that is widely accepted by the research community. The descriptions of metrics used in this work are given below.

3.6.1. Minimal distortion (MD)

This metric is a penalty-based approach for measuring distortion after performing generalization on a dataset based on the dataset taxonomy tree. Distortion count is incremented with each generalization or suppression operation instance on a record. It is a single-attribute measure used as a data metric and search metric.

3.6.2. Loss metric (LM)

LM is defined in terms of normalized loss for each attribute of every tuple. For tuple t and categorical attribute A, suppose the value t[A] has been generalized to x. if │A│ represents the size of the domain of attribute A and M represents the number of values in this domain that could have been generalized to x, then the loss for t[A] is

(1)

Graph

The loss for attribute A is defined as the average of the loss t[A] for all tuples t. This metric was proposed to quantify information loss by measuring the ambiguity introduced by generalization and suppression as shown in Liu ([18]).

3.6.3. Weighted hierarchical distance (WHD)

This metric is suitable for measuring cell generalization, tuple generalization, and table generalization. Li, Wong, Fu, and Pei ([12]) defined this metric and its application as follows.

Definition 1 (Distortion of generalization of cells). Let be the height of a domain hierarchy, and let levels be the domain levels from the most general to most specific, respectively. Let the weight between domain level and be predefined, denoted by , where When a cell is generalized from level to level where the weighted hierarchical distance of this generalization is defined as

(2)

Graph

where is the weight which can be defined to enforce a priority in generalization.

For a uniform weight, , where .

For height weight, , where and is a real number provided by the user.

Definition 2 (Distortions of generalization of tuples). Let { be a tuple and be a generalized tuple of.

Let level be the domain level of in an attribute hierarchy. The distortion of this generalization is defined as

(3)

Graph

Definition 3 (Distortions of generalization of tables). Let view be generalized from table, be the ith tuple in D, and be the ith tuple in. The distortion of this generalization is defined as

(4)

Graph

where is the number of tuples in D.

3.6.4. Differential entropy

Agrawal and Aggarwal ([1]) proposed a privacy measure based on the differential entropy of a random variable. The differential entropy h(A) of a random variable A is defined as

(5)

Graph

3.6.5. Trade-off metrics

The idea behind a trade-off metric is to consider both the privacy and information requirements at every anonymization operation and to determine an optimal trade-off between the two requirements. Fung et al. ([9]) proposed a search metric based on information/privacy trade-off. The metric is depicted below:

(6)

Graph

where a denotes the anonymization operation such as suppression, generalization, etc.; IG(a) and PL(a) denote information gain and privacy loss, respectively, consequent upon the anonymization operation.

The minimal distortion metric and loss metric were used to measure the suppression ratio and data utility. The trade-off metric was used to evaluate the information gain and the privacy loss. The weighted hierarchical distance metric was used to measure the distortion ratio of the release. These metrics were used because they can be adapted to measure cell suppression used with slicing in this work.

4. Results and discussion

4.1. Cell suppression

The measure of cell suppression shown in Figure 3 and Figure 4, using MD and LM showed sharp decline in suppressed HSA values as the privacy threshold is increased.

Graph: Figure 3. Cell suppression of all HSAs using MD.

Graph: Figure 4. Cell suppression of all HSAs using LM.

4.2. Suppression ratio

Suppression ratio of all HSAs with regard to total number of HSAs, number of records, and number of entity cells are given in Figure 5. The graph of Figure 5 clearly indicates the suppression ratios of HSAs to total number of HSAs; total record and all cells are insignificant as shown in the superimposed lines of graph.

Graph: Figure 5. Suppression ratio for all HSAs.

4.2.1. Utility gain

The utility gain with regard to suppressed and unsuppressed HSAs for each SA are shown in Figures 6,7, 8, and 9. The utility gain increases as the gap widens between the suppressed and the unsuppressed HSAs for different privacy thresholds.

Graph: Figure 6. Utility gain of education HSA.

Graph: Figure 7. Utility gain of relationship HSA.

Graph: Figure 8. Utility gain of occupation HAS.

Graph: Figure 9. Utility gain of workplace HSA.

4.3. Percentage of suppressed to unsuppressed HSAs

The percentages of suppressed HSAs to unsuppressed HSAs across all buckets are displayed in Figure 10. This graph shows that this anonymization method will reduce data degradation and improve data utility. The not-in-family HSA in the relationship-sensitive identifier attribute recorded only about 32% suppression to achieve privacy across all buckets at a confidence bounding parameter of 5%. With a higher parameter the percentage of HSAs suppressed to unsuppressed declines sharply based on their frequency distribution. It is worth noting that self-emp-not-inc, the HSA for workplace, will not be suppressed under a confidence bounding parameter of 15% and above, because its frequency does not violate the privacy parameter standard. It equally reveals selective suppression based on whether an HSA frequency violates the stated privacy parameter. The suppression percentages decline as the confidence bound increases, though in varying degree across HSAs. The unsuppressed HSAs improve the overall data utility and usefulness.

Graph: Figure 10. Percentage suppressed and unsuppressed HSAs.

4.4. Percentage of suppressed HSAs to total records

Figure 11 ggivesave deeper insight into the amount of HSAs suppressed with regards to specified privacy parameters to the size of the entire record in the dataset. This result shows that the maximum cell suppression of HSAs recorded is less that 8% of the entire data record to achieve required privacy. With a privacy confidence bounding parameter of 5%, aggregate cell suppression for all the HSAs is less than 17%. This value will diminish further if every entity in the dataset is taken as a cell. Then this percentage will greatly reduce. This observation reveals the gains of anonymizing multisenstive-attribute datasets with this approach, for it leads to the enhancement of data utility while preserving privacy of data owners.

Graph: Figure 11. Suppressed HSAs to total records in dataset.

4.5. Privacy protection against attack models

  • Linkage attacks such as record linkage and attribute linkage, if launched on the system, will not be successful. The entropy of suppressed HSAs, random permutation of the position of the QIDs, and the clustering of HSAs with a single SID will effectively protect the system from these attack models. Table linkage attack can be thwarted based partly on the size of the privacy threshold and partly due to the earlier protective approach which thwarts other linkages.
  • Probabilistic, skewness, and homogeneity attacks will be thwarted, for each sliced bucket was created based on inability to confidently infer a sensitive attribute based on users' preferred privacy parameter on the HSAs attribute frequency, observed data skewness in sliced buckets is reduced through bucket merging, the size of the sliced buckets will lead to a great entropy which will reduce probabilistic inferences. The system cannot thwart similarity attack due to the choice of one HSA in each SA column.
  • Background knowledge attack and minimality attack will be effectively thwarted. The dynamic nature of the system with regard to the choice of HSAs, the confidence bounding privacy parameter, the size of the sliced buckets coupled with the uniform suppression symbol adopted for cell suppression of all HSAs that violated the privacy threshold will undoubtedly prevent these variants of attack.
  • Correspondence attack will be thwarted, for new updates are placed in new buckets based on no violation of the privacy standard. Different releases more often than not will not have the same parameters
5. Contribution

The contribution of this work includes the introduction of a new approach to anonymizing MSAs through the combination of the LKC-Privacy model, slicing technique and cell suppression; enhancing MSAs anonymization through dynamic and web-based features; and anonymizing MSAs with improved utility gain and reduced data degradation.

6. Recommendation

This research work was proposed for privacy preservation of the data publication of MSAs with improved data utility of anonymized release. The following are hereby recommended for future work.

  • Enhancement of the PPDP system that can combine both numerical and categorical multiple sensitive attributes
  • Web-based data publishing of MSA systems with secured access and authorization
  • Privacy-preserving data mining of datasets with MSAs
  • Necessity of standardized metrics for utility and privacy accuracy measurements
Footnotes 1 Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/uiss. References Agrawal, D., & Aggarwal, C. C. (2001). On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 247–255). New York, NY: ACM. 2 Albouna, B., Clifton, C., & Malluhi, Q. (2015, March). Efficient sanitation of unsafe data correlations. Workshop proceedings of the EDBT/ICDT 2015 Joint Conference, Brussels, Belgium. 3 Bertino, E., Lin, D., & Jiang, W. (2006). A survey of quantification of privacy preserving data mining algorithms. Privacy-preserving data mining). Volume 34 of Advances in database systems. Dordrecht, The Netherlands: Kluwer Academic. 4 Blake, C. L., & Merz, C. J. (2006). UCI repository of machine learning databases University of California. Irvine, CA: Department of Information and Computer Sciences. 5 Chen, R., (2012). Toward privacy in high-dimensional data publishing. Retrieved from http://www.spectrum.library.concordia.ca/974691/4/Chen%5fPhD%5fF2012.pdf 6 Das, D., & Bhattacharyyu, D. K. (2010). Decomposition+: Improving l-diversity for Multiple Sensitive Attributes. Retrieved from https://devayon.files.wordpress.com/2012/05/decomposition-plus-ddas-dkb.pdf 7 Dwork, C. (2006, July). Differential privacy. Presented at the Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), Venice, Italy. 8 Fung, B. C. M., Wang, K., Chen, R., & Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys, 42(4), Article 14. doi: 10.1145/1749603.1749605 9 Fung, B. C. M., Wang, K., Fu, A. W., & Yu, P. S. (2011). Introduction to privacy-preserving data publishing concepts and techniques. Boca Raton, FL: Chapman & Hall/CRC Press, Taylor & Francis Group. Gal, T. S., Chen, Z., & Gangopadhyay, A. (2008). A privacy protection model for patient data with multiple sensitive attributes. International Journal of Information Security and Privacy, 2(3), 28–44. doi:10.4018/IJISP Han, J., Luo, F., Lu, J., & Peng, H. (2013). SLOMS: A privacy preserving data publishing methods for multiple sensitive attributes microdata. Journal of Software, 8, 12. doi: 10.4304/jsw.8.12.3096-3104 Li, J., Wong, R. C., Fu, A. W., & Pei, J. (2006). Achieving k-anonymity by clustering in attribute hierarchical structures. Data Warehousing and Knowledge Discovery 4081, 405–416. Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-Closeness: Privacy beyond k-anonymity and i-diversity. IEEE 23rd International Conference, Istanbul, Turkey. Li, T., & Li, N. (2009). On the tradeoff between privacy and utility in data publishing. In Proceedings of the ACM SIGKDD International Conference Knowledge Discovery and Data Mining (KDD) (pp. 517–526). New York, NY: ACM. Li, T., Li, N., Zhang, J., & Molloy, I., (2009). Slicing: A new approach to privacy preserving data publishing. Retrieved from http://www.javab4u.com/project/basepapers/java /Slicing%20A%20New%20Approach%20to%20Privacy%20Preserving.pdf Liu, J.. 2010, Enhancing Utility in Privacy Preserving Data publishing (PhD Dissertation). School of Computer Sciences, Simon Fraser University, Burnaby, Canada. Liu, F., Jia, Y., & Han, W. (2012, October). A new K-anonymity algorithm towards multiple-sensitive attributes. 2012 IEEE 12th International Conference on Computer and Information Technology, Rome, Italy. Liu, Q., Shen, H., & Sang, Y. (2015). Privacy-preserving data publishing for multiple numerical sensitive attributes. Tsinghua Science and Technology, 20(93), 246–254. doi: 10.1109/TST.2015.7128936. Machanavajjhala, A., Kifer, D., & Gehrke, J. (2006). i-Diversity: Privacy beyond k-Anonymity. Retrieved from http://www.cse.psu.edu/~dkifer/papers/ldiversityTKDDdraft.pdf Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big data, big analysics: Emerging business intelligence and analytic trends for today's businesses. Hoboken, NJ: John Wiley & Sons. Mogre, N. V., Agarwal, G., & Patil, P. (2013). Privacy preserving for high-dimensional data using anonymization technique. International Journal of Advanced Research in Computer Science and Software Engineering Research, 3(6). Available online at www.ijarcsse.com Mohammed, N., Fung, B. C. M., Hung, P. C. K., & Lee, C. K. (2009). Anonymizing healthcare data: A case study on the blood transfusion service. In ACM SIGKDD, 2009 (pp. 1285–1293). New York, NY: ACM., Mohan, A. K., Phanindra, M. A., & Prasad, M. K. (2012). Anonymization technique for data publishing using multiple sensitive attributes. IJCST, 3(4). Retrieved from www.ijcst.com Onashoga, S. A., Bamiro, B. A., Akinwale, A. F., & Oguntuase, J. A. (2015). Privacy preserving data publishing of multiple sensitive attributes: A taxonomic review. Presented at the International Conference on Applied Information Technology. Federal University of Agriculture, Abeokuta, Ogun State, Nigeria. Samarati, P., & Sweeney, L. (1998). Generalizing data to provide anonymity when disclosing information. Presented at the Proceedings of the ACM SIGACT-SIGMOD-SIGART 1998 Symposium on Principles of Database Systems (PODS98), Seattle, WA. Singh, A. P., & Parihar, D. (2013). A review of privacy preserving data publishing technique. International Journal of Emerging Research in Management &Technology, 2(6), 32-38. Xu, Y., Ma, T., Tang, M., & Tian, W. (2014). A survey of privacy preserving data publishing using generalization and suppression. Applied Mathematics & Information Sciences, 8(3), 1103–1116. doi: 10.12785/amis/080321 Ye, Y., Liu, Y., Lv, D., & Feng, J., (2009). Decomposition: Privacy preservation for multiple sensitive attributes. http://www.cs.columbia.edu/~yeyang/decompose.pdf

By S. A. Onashoga; B. A. Bamiro; A. T. Akinwale and J. A. Oguntuase

Reported by Author; Author; Author; Author

Prof. James Adedayo Oguntuase is a Professor of Mathematics in the Department of Mathematical Sciences, College of Physical Sciences, Federal University of Agriculture Abeokuta (FUNAAB), Nigeria. His research interests are Mathematical Analysis and Theory of Inequalities. He has wide publications in International journals.

Prof. Adio Taofiki Akinwale is a Professor of Computer Science in the Department of Computer Sciences, College of Physical Sciences, Federal University of Agriculture Abeokuta (FUNAAB), Nigeria. His research interests are in Database Management System and Query Algorithms Optimization. He has published widely in international journals.

Dr. (Mrs.) Saidat Adebukola Onashoga is an Associate Professor in the Department of Computer Sciences, College of Physical Sciences, Federal University of Agriculture Abeokuta (FUNAAB), Nigeria. Her research interests are Information Security, Data Mining, Information System, and currently in Privacy Preservation Algorithm in Mobile, Social and Cloud Computing.

Bamiro Bashiru Adejugba, M.Sc, is a Research student of the Department of Computer Sciences, College of Physical Sciences, Federal University of Agriculture Abeokuta (FUNAAB), Nigeria. Currently his research interest is in Privacy Preserving Data Publishing.

Titel:
KC-Slice: A dynamic privacy-preserving data publishing technique for multisensitive attributes
Autor/in / Beteiligte Person: Bamiro, B. A. ; Oguntuase, James A. ; Onashoga, S. A. ; Akinwale, A. T.
Link:
Zeitschrift: Information Security Journal: A Global Perspective, Jg. 26 (2017-05-04), S. 121-135
Veröffentlichung: Informa UK Limited, 2017
Medientyp: unknown
ISSN: 1939-3547 (print) ; 1939-3555 (print)
DOI: 10.1080/19393555.2017.1319522
Schlagwort:
  • Information Systems and Management
  • Generalization
  • Computer science
  • 02 engineering and technology
  • Data publishing
  • Random permutation
  • computer.software_genre
  • Data degradation
  • Field (computer science)
  • Computer Science Applications
  • Reduction (complexity)
  • Attack model
  • 020204 information systems
  • 0202 electrical engineering, electronic engineering, information engineering
  • 020201 artificial intelligence & image processing
  • Data mining
  • Tuple
  • computer
  • Software
Sonstiges:
  • Nachgewiesen in: OpenAIRE

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

oder
oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

oder
oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.

xs 0 - 576
sm 576 - 768
md 768 - 992
lg 992 - 1200
xl 1200 - 1366
xxl 1366 -