The proliferation of location-based service applications has led to a substantial surge in the amount of life trajectory data produced by mobile devices. And these data frequently contain confidential personal details. Simultaneously, the corresponding relatively lagging privacy protection technology and the improper trajectory data handling method will make tremendous problems with privacy breach. Therefore, this paper presents a trajectory privacy data publishing scheme, denoted as LORDP, which is based on local optimisation and R-tree. The proposed scheme aims to handle sensitive data while improves trajectory protection effectiveness. Firstly, the scheme combines the LKC-privacy model requirement to filter out the minimum violating sequences set, to reduce data sensitivity and the amount of injected noise. Secondly, R-tree is constructed based on trajectory similarity. Finally, Laplacian noise is added to the R-tree's leaf nodes constrained by differential privacy. The experiments show that the proposed LORDP algorithm significantly enhances the utility of data compared to other algorithms, and reduces the loss rate of about approximately 2% for per trajectory data, which shows that the present algorithm is extremely effective.
Keywords: The minimum violating sequence; local optimisation; R-tree; differential privacy
With the continuous upgrading of Internet technology, accurate location-based services are highly sought after. There are all types of mobile terminals that also generate a large amount of data information (Bozkurt et al., [
However, trajectory data that contains rich personal privacy is vulnerable to network hackers' attacks. The leakage of such sensitive information, such as home address, economic status, occupation and identity, etc. (Liu et al., [
Therefore, this paper proposes a trajectory privacy data publishing scheme based on local optimisation and R-tree for current technologies and requirements. The scheme consists of two parts. The first part, using local optimisation is to reduce the sensitivity value of the trajectory data to be published. And the second part is an R-tree combined with the current popular differential privacy technology. The advantages of this scheme are as follows.
- In order to trade-off data usability and privacy preservation, local optimisation is performed firstly before injecting noise, to reduce the sensitivity of the trajectory data.
- To make the local optimisation better and more realistic, this paper combines current urban social information as the reference range and the spatio-temporal characteristics of the trajectory for optimisation.
- The scheme builds an R-tree on basis of the similarity of the trajectory data and injects noise to its leaf nodes, to prevent inference attacks on non-sensitive information and reduce the risk of privacy leakage.
The various sections of this paper are organised as follows: Section 2 presents the related study on trajectory privacy protection. Section 3 proposes a trajectory privacy data publishing scheme based on local optimisation and R-tree. Section 4 mainly introduces the publishing scheme and algorithm analysis of trajectory data. Section 5 describes the results of the experimental evaluation. Finally, Section 6 summarises our work and gives future research interests.
Nowadays, the trajectory data privacy protection has become one of the widely researched academic subjects, such as domestic and international politics, economy and so on. As a result, privacy-preserving approaches have emerged based on various techniques, such as distortion, anonymisation and differential privacy techniques. Qiu et al. ([
In Addition, a series of specific models have emerged from the contemporary social background. In 2010, Mohammed et al. ([
For example, Wang et al. ([
Graph
Therefore, this paper proposed a differential privacy protection scheme based on local optimisation and R-trees. Combining the local community's location data sheet with advanced optimisation largely reduces the amount of injected noise and protects the usability of the data. Furthermore, the use of an R-tree indexing structure has the advantage of making differential privacy more resistant to background knowledge attacks.
(Trajectory T): Trajectory is a series of continuous points generated by mobile users in time sequence and space. The data sequence T is an ordered combination of data, which can be expressed as:
Graph
where |T| denotes the number of a trajectory data sequence, and p
Graph
Graph
(Trajectory dataset D): The trajectory dataset D is a collection of multiple trajectory data, denoted as:
Graph
where n denotes the total number of trajectories contained in the data set D, which can also be expressed as
Graph
Graph
Graph
(Location Data Sheet R): The Location Data Sheet is a collection of landmarks in the territory where the trajectory data is to be published, which contains the latitude and longitude of the building, social classification and special identification, etc.
(Inference attack on non-sensitive information): Inference attack on non-sensitive information is an attack in which an attacker infers sensitive information from non-sensitive information by multiple queries. As shown in Table 1, it is assumed that the attacker knows some users' partial information and trajectories, denoted as
Graph
Table 1. Partial trajectory data sheet R.
User Trajectory coordinate Road Network Info T1 1a, 3b, 4c, 5d, 6e mall, school, supermarket, library T1 1c, 2a, 3b, 5f, 6e, 7g school, gallery, library T3 2a, 4d, 5f, 6e, 7g, 9h restaurant, library, gymnasium T4 1b, 3b, 4f, 5d, 6c, 8h school, supermarket, gymnasium T5 2e, 3b, 4a, 5d, 6e, 8d school, mall, library, supermarket
(LKC-privacy model) (Mohammed et al., [
Graph
•
Graph
•
Graph
(Violation dataset Q): Suppose the length of the sequence P on the trajectory dataset satisfies
Graph
(ε-differential privacy) (Dwork, [
(
Graph
algorithm M is said to provide ε-differential privacy protection. Where
Graph
(Global sensitivity): For the adjacent data sets D
Graph
Graph
(
Graph
where R represents the real number domain of the mapping, and
Graph
Graph
Graph
(Laplace mechanism): Given a data set D and query function
Graph
(
Graph
then the randomised algorithm M satisfies
Graph
Graph
Graph
Graph
This section provides a detailed description of the proposed trajectory data publishing protection scheme in this paper. Where sub-section 4.1 presents an overview of the scheme. Sub-sections 4.2 and 4.3 elaborate on the specific algorithm design. Sub-sections 4.4 and 4.5 provide rigorous analyses of the privacy protection level and complexity of the proposed algorithm, respectively.
In this paper, we propose a trajectory privacy data publishing scheme based on local optimisation and R-tree as follows.
- Local optimised trajectory data
Firstly, the method constructs a replaceable location data sheet. Then the trajectory data, to be published and not satisfying the LKC-model constraints, are filtered and optimised. The process of optimisation is to replace sensitive trajectory data by combining spatio-temporal and the location data sheet to ensure data authenticity.
- Establishing R-tree and differential privacy data protection
The trajectory data is divided into trajectory segments according to the spatial location, and then the trajectory segments are inserted into the tree by combining the construction principle of R-tree. And finally the Laplace noise is added to the generated leaf nodes to protect the data.
To ensure the local optimisation more reasonable and effective, the optimised data was chosen from the location data sheet which was created by combining the local social information about the trajectory data. The optimisation method performs frequency statistics on the previously collected trajectory database. And then the suitable location data is filtered from the location data sheet for replacement. The construction process of the location data sheet is shown in Figure 1.
Graph: Figure 1. Location data sheet creation process.
First, the trajectory data to be published that does not meet the requirements of the LKC-privacy model are filtered out and then the order for optimisation is determined by the score of the scoring function Score (p).
Graph
where
Graph
Graph
Graph
Graph
Graph
As for local optimisation of the violated trajectory data, the spatio-temporal characteristics of the trajectory data are taken into account. Therefore, it is handled in two parts. When the trajectory data time span is large or the position change is small, the number of trajectory data needs to be partitioned and replaced. And when the trajectory data time span is small or the position is updated frequently, the timestamp of trajectory data needs to be partitioned and replaced gradually.
For example, a user's trajectory
Graph
Graph
Graph
Graph
Graph
Graph
Graph
Graph
It is simple to circle the range of replaceable positions on the extracted trajectory data
Graph
Graph
Graph: Figure 2. Filtering available locations.
When filtering optimal positions in the location data sheet, the most suitable points need to be selected for replacement based on the filtering results. The filtering calculation process involves the direction of the trajectory before optimisation, the distance and the classification of other information about the location data sheet. As denoted in Eq.(
Graph
where
Graph
Graph
Graph
Graph
Graph: Figure 3. The filtered trajectory path.
Algorithm 1 is the pseudocode for local optimisation. Line 1 is the location data sheet obtained by filtering the local social map. Lines 4–20 loop the whole data set D to get the violation dataset Q, where lines 6–12 judge whether the current trajectory segment is the violation trajectory sequence. Line 21 scores the violation trajectory sequence to determine the order of optimisation. Lines 22–27 optimise the trajectory data. Finally, line 28 returns the locally optimised trajectory data set New-D (Table 2).
Table 2. Pseudocode for local optimisation.
R-tree is a structure that serves as a multidimensional data storage space, and it can perform multiple simultaneous operations with high efficiency, such as additions and deletions. Currently, R-tree deals with coordinating in real map spaces or emerging virtual worlds on the web, such as constructing streets, drawing coastlines and buildings, etc. In addition, R-tree has high research value and significance in processing data operations with spatio-temporal logic. And it can preserve the logical structure of spatial location points, and has the function of counting queries.
First, when building the R-tree structure, trajectory is classified according to the user's information on trajectory dataset. Then the R-tree is constructed by comparing the similarity between mobile user trajectories one by one, i.e. considering the similarity percentage of trajectory sequences between different trajectories. The formula for calculating the similarity of trajectory sequences is as follows.
Graph
where
Graph
Graph
Graph
where
Graph
Graph
The steps for building the R-tree and adding noise are described as follows.
- Initialisation of R-tree structure: the data structure of R-tree is
Graph
Graph
Graph
Graph
Graph
Graph
- Node selection: When inserting new mobile user trajectory data, the trajectory data needs to be inserted into the appropriate node. Specifically, starting from the root node, select the appropriate node from top to bottom. If it is added to a leaf node, the node is returned and the insertion is completed. However, when the node is not a leaf node, it is necessary to compare the similarity degree of the trajectory sequence corresponding to the index in the node and select the larger branch for insertion. Then compare the Frechet distance of the trajectory when the similarity degree of the same trajectory sequence is 0 or the same, the node with smaller average Frechet distance is selected. Then repeat the above steps until the insertion into the node.
- Splitting of nodes: If the number of indexes of the node into which the new trajectory is inserted has reached M, this node will need to be split. Assuming their similarity is not zero between the sequence of traces to be inserted and the nodes to be split, and the node indexes with lower similarity are divided into different nodes. Then the other index records are also split under the corresponding nodes and updated upward to the corresponding parent nodes. But if the similarity is zero, the Frechet distance is calculated between different trajectories. The two trajectories with larger distance are divided into two nodes, and the other trajectories are also divided under the corresponding nodes and update their parents.
- Update parent node information: When inserting new mobile user trajectory data, the index record parent node needs to be updated in its corresponding. Assuming that the index number in the parent node has not yet reached the maximum value when a node is inserting new track data, the new index record is inserted directly and the corresponding parent node is updated. Conversely, the parent node of the split and the parent node of the new leaf node are also updated after the split. Repeat the above operation until the new track is completely inserted.
- Adding Laplace noise: Each insertion of new trajectory data forms a new R-tree and generates new leaf nodes, which are continuously refreshed. Finally, Laplace noise is injected into the leaf nodes. In addition, the privacy budget is allocated by dividing
Graph
Algorithm 2 is the algorithm for building R-tree and data perturbation. Lines 1–19 are to construct the R-tree based on the trajectory data. Where lines 2–12 are the process of selecting appropriate leaf nodes and lines 14–18 are the process of node splitting. Line 20 is to distribute the privacy budget equally according to the generated number of leaf nodes. Lines 21–27 are to determine whether the data are location information, other types of data information or injected noise respectively. Finally, line 28 returns the protected dataset
Graph
Table 3. Pseudocode for constructing R-tree and adding noise.
This section analyses the privacy protection degree of the LORDP algorithm. The privacy protection degree of the algorithm depends on the privacy budget. When the privacy budget is smaller, the privacy protection degree of the algorithm is higher, which shows a negative correlation. In this paper, a differential privacy-preserving algorithm is constructed through the data structure of the LORDP-tree, so we need to prove that the LORDP algorithm satisfies ε-differential privacy. Each layer of the LORDP-tree contains mobile users. The mobile users in the layers are disjoint, and the privacy budget allocated to nodes of each layer is
Graph
Graph
The algorithm Mi for adding noise at the i-th leaf node of the LORDP-tree satisfies ε-differential privacy. The proof is as follows.
Assuming that
Graph
Graph
Graph
where z ∈ Z, o ∈ O. Z obeys Laplace distribution, so
Graph
From the above proof process, we can see that algorithm Mi satisfies Theorem 7, so algorithm Mi satisfies ε-differential privacy. From the sequential combination feature of differential privacy, the algorithm Mi of each leaf node in the LORDP-tree satisfies
Graph
Graph
The algorithm consists of two parts in the paper, which optimise the trajectory data locally and protect the data with differential privacy.
In the first part, given the original trajectory dataset D, violating trajectory sequences are identified and optimised with the location data sheet A from the dataset D. The time complexity of the optimisation process is O(|D|·|p| + |Q|), where |D| is the total number of trajectories in the dataset D. |p| is the number of trajectory points in each trajectory. And |Q| denotes the number of violating trajectory data to be optimised. The purpose of this part is to pre-optimise the data, reduce its sensitivity, and improve data security.
In the second part, R-tree and differential techniques are combined. R-tree is built top-down based on trajectory similarity, and then privacy budget is assigned at the leaf nodes of R-tree. The time complexity is O(|T|·h), with |T| is the length of a trajectory data and h is the height of R-tree. The time complexity is consistent with the cost of constructing R-tree and is influenced by the number and length of trajectories.
Compared with other algorithms, the pre-optimised processing has proven to be highly effective in reducing the loss of data caused by the injecting of Laplace noise. By carefully considering the spatio-temporal characteristics of trajectory data, the optimisation process ensures that the resulting data is more accurate and representative of the real trajectory patterns. Furthermore, the algorithm leverages the advantages of differential privacy to enhance data security, providing additional protection against unwanted data disclosures. Overall, the approach offers a highly optimised and secure method for publishing trajectory data while preserving individual privacy.
This experiment is conducted in Python environment with Intel(R) Core(TM) i3-10110U CPU 2.59 GHz and 16.0 GB of RAM. In this chapter, the dataset used for the experiments is the Landmark real dataset, which is a landmark consisting of geographic coordinates of 48 large states in the US provided by the infochimps big data website with about 880k data points, where one kind of data is used as sensitive values.
Among the measures of trajectory data publishing, calculating data loss is one of the important reference indicators to measure the availability of trajectory data. The experimental part of this paper is divided into two parts. The first part is to measure the data availability by comparing the loss rate of the trajectory location points before and after optimisation. The loss rate of trajectory data points is calculated by the following formula.
Graph
where
Graph
Graph
The second part is to calculate the average error for the trajectory data before and after adding noise in the context of the same privacy budget allocation, which is used to measure the privacy data protection. Assuming that the query count is Q, the average error is calculated as follows.
Graph
where
Graph
Graph
To make the above measure more convincing, it is necessary to further measure the usability of the trajectory data before and after injecting noise through the loss of quality of the average data points. The formula for calculating the average data point quality loss is as follows.
Graph
where
Graph
Graph
To verify the effectiveness of trajectory privacy data publishing scheme based on local optimisation and R-trees (LORDP), this paper's experiments are mainly divided into two parts: before and after optimisation, and after adding noise.
In the part before and after optimisation, we compared with the TOS algorithm and LSUP algorithm proposed by Bai et al. ([
In the part after adding noise, we compared with the TLDP algorithm proposed by Zhao et al. and the LPPM algorithm proposed by Cunha et al. ([
The experimental procedure was as follows.
To verify the data availability improvement before and after the optimisation algorithm, the LORDP algorithm in this paper compared with the TOS algorithm in reference (Bai et al., [
Graph: Figure 4. Effect of different K values on the loss rate of trajectory data point.
Graph: Figure 5. Effect of different C values on the loss rate of trajectory data points.
From the experimental results of Figures 4 and 5, it can be seen that the loss rate of trajectory data points increases when the value of anonymity K increases gradually, i.e. the data loss rate decreases when the data to be optimised decreases. When the confidence threshold C is gradually increased, the trajectory data loss rate gradually decreases. Since the increase of confidence level reduces the number of trajectory data points to be optimised, the loss rated decreases. Combining the experimental results of the other two algorithms, it appears that the local optimisation algorithm in this paper, combined with the location data sheet, is able to consistently control the reduction in data points loss rate by approximately 2% to 5% for the same K or C values. This indicates that the local optimisation algorithm can effectively reduce data sensitivity and reduce data loss by a certain degree.
After a pre-optimised algorithm, global privacy protection measures are applied to the data. Therefore, the experiment compares the LORDP algorithm of this paper with the literature (Zhao et al., [
Graph: Figure 6. Comparison of the average error of different algorithms.
Graph: Figure 7. Comparison of the average data point quality loss results of different algorithms.
The experimental results of Figure 6 show that the average error of the data decreases gradually with the gradual increase of the privacy budget. The experimental results of Figure 7 show that the error results are averaged to each trajectory data, which can further illustrate that the average trajectory point loss rate of this paper's algorithm is smaller than other algorithms. The comparison of the other two algorithms shows that the LORDP algorithm can reduce the loss rate of 0.002% to 0.008% for each trajectory data onto the same privacy budget allocation, and can reduce the loss rate of up to 0.7% for the whole data set.
The sensitivities present in longer trajectory data may be more complex. Thus, the processing of local optimisation and R-tree construction is prone to injure of data availability. The better methods in the above two experiments is chosen to compare with the proposed algorithm. Thus, the experiment compares the LORDP algorithm with the TOS algorithm and TLDP algorithm. The privacy budget for this experiment is set to ε = 0.5 and ε = 1 respectively, and the experimental results are shown in Figures 8 and 9.
Graph: Figure 8. When ε = 0.5, effect of trajectory data length on different scenarios.
Graph: Figure 9. When ε = 1, effect of trajectory data length on different scenarios.
The experimental results show that the average error of data increases gradually with the trajectory length under the same privacy budget conditions. But the average error rates of the LORDP algorithm are all lower than those of the other algorithms. And the comparison between Figures 8 and 9 shows that privacy budget increases, while the average error of the trajectory data will be smaller and the privacy protection better. In particular, when the length of the trajectory data tends to 40–50, the protection effect is better compared to the other effects.
Therefore, the LORDP algorithm avoids adding too much Laplacian noise to the trajectory data after processing by local optimisation, which prevents data distortion as much as possible. And it is also more effective against reducing the mean error of the algorithm, i.e. improving the availability and safety of the trajectory data.
In this paper, we propose a trajectory privacy data publishing scheme based on local optimisation and R-tree. The scheme makes the release of raw trajectory data less likely to expose risks. It combines the latest local location data sheet for local optimisation and uses differential privacy technique to protect sensitive and non-sensitive information. Compared to other algorithms, the local optimisation of LORDP algorithm in this paper can better protect the availability, greatly improve data utilisation, and reduce the risk of privacy leakage. However, with the popularity of mobile devices and Internet of Things, the generation and collection of trajectory data are becoming more and more common, and the demand for real-time processing and analysis is also increasing. Therefore, considering real-time processing is very important in the research of trajectory data processing and analysis in the future. Additionally, it is also necessary to further improve and optimise the privacy budget allocation strategy based on the different characteristics and application scenarios of the data, achieving better trade-off between data utility and privacy protection.
No potential conflict of interest was reported by the author(s).
By Peiqian Liu; Duoduo Wu; Zihao Shen and Hui Wang
Reported by Author; Author; Author; Author