Trajectory privacy data publishing scheme based on local optimisation and R-tree

Liu, Peiqian ; Wu, Duoduo ; et al.

In: Connection Science, Jg. 35 (2023-12-01), Heft 1

Online academicJournal

Zugriff:

Volltext (PDF)

The proliferation of location-based service applications has led to a substantial surge in the amount of life trajectory data produced by mobile devices. And these data frequently contain confidential personal details. Simultaneously, the corresponding relatively lagging privacy protection technology and the improper trajectory data handling method will make tremendous problems with privacy breach. Therefore, this paper presents a trajectory privacy data publishing scheme, denoted as LORDP, which is based on local optimisation and R-tree. The proposed scheme aims to handle sensitive data while improves trajectory protection effectiveness. Firstly, the scheme combines the LKC-privacy model requirement to filter out the minimum violating sequences set, to reduce data sensitivity and the amount of injected noise. Secondly, R-tree is constructed based on trajectory similarity. Finally, Laplacian noise is added to the R-tree’s leaf nodes constrained by differential privacy. The experiments show that the proposed LORDP algorithm significantly enhances the utility of data compared to other algorithms, and reduces the loss rate of about approximately 2% for per trajectory data, which shows that the present algorithm is extremely effective.

Trajectory privacy data publishing scheme based on local optimisation and R-tree

The proliferation of location-based service applications has led to a substantial surge in the amount of life trajectory data produced by mobile devices. And these data frequently contain confidential personal details. Simultaneously, the corresponding relatively lagging privacy protection technology and the improper trajectory data handling method will make tremendous problems with privacy breach. Therefore, this paper presents a trajectory privacy data publishing scheme, denoted as LORDP, which is based on local optimisation and R-tree. The proposed scheme aims to handle sensitive data while improves trajectory protection effectiveness. Firstly, the scheme combines the LKC-privacy model requirement to filter out the minimum violating sequences set, to reduce data sensitivity and the amount of injected noise. Secondly, R-tree is constructed based on trajectory similarity. Finally, Laplacian noise is added to the R-tree's leaf nodes constrained by differential privacy. The experiments show that the proposed LORDP algorithm significantly enhances the utility of data compared to other algorithms, and reduces the loss rate of about approximately 2% for per trajectory data, which shows that the present algorithm is extremely effective.

Keywords: The minimum violating sequence; local optimisation; R-tree; differential privacy

1. Introduction

With the continuous upgrading of Internet technology, accurate location-based services are highly sought after. There are all types of mobile terminals that also generate a large amount of data information (Bozkurt et al., [2]; Wu et al., [19]). These data not only cover many fields, but also are of great scientific research value. For example, the analysis of trajectory data (a branch of many fields) is used to support the foundation for investigating population commuting and supply impacts (Hu et al., [7]; Niu et al., [13]; Tu et al., [17]). In addition, this kind of research can improve efficiency and quality of services (Li et al., [8]). What's more, in the context of today's complex environment, the legitimate collection and processing of trajectory data play a regulatory role in emergency management (Liu et al., [9]; Wang et al., [18]). In addition, for ordinary people, they hope that more convenient network services can be obtained. And when seeking web services, data uploading ensures that one's privacy and mobility are protected. For professionals such as social scientists, helping society towards a better direction is the main point of publishing trajectory data (Ghosh et al., [6]; Tedjopurnomo et al., [15]). For example, the planning of national roads, social services and the laying of transport facilities (Liu et al., [11]). Consequently, the processing of the data before publication requires a reasonable desensitisation process, and its result should protect the availability of trajectory data as soon as possible.

However, trajectory data that contains rich personal privacy is vulnerable to network hackers' attacks. The leakage of such sensitive information, such as home address, economic status, occupation and identity, etc. (Liu et al., [10] ; Zhang et al., [21], [22]). Therefore, in order to address the above issues, our proposed solution takes into full account the spatio-temporal characteristics of trajectory data and the environmental conditions of actual scenarios, while also offering appropriate protective measures for non-sensitive information to strike a balance between the usability and security of the trajectory data.

Therefore, this paper proposes a trajectory privacy data publishing scheme based on local optimisation and R-tree for current technologies and requirements. The scheme consists of two parts. The first part, using local optimisation is to reduce the sensitivity value of the trajectory data to be published. And the second part is an R-tree combined with the current popular differential privacy technology. The advantages of this scheme are as follows.

In order to trade-off data usability and privacy preservation, local optimisation is performed firstly before injecting noise, to reduce the sensitivity of the trajectory data.

To make the local optimisation better and more realistic, this paper combines current urban social information as the reference range and the spatio-temporal characteristics of the trajectory for optimisation.

The scheme builds an R-tree on basis of the similarity of the trajectory data and injects noise to its leaf nodes, to prevent inference attacks on non-sensitive information and reduce the risk of privacy leakage.

The various sections of this paper are organised as follows: Section 2 presents the related study on trajectory privacy protection. Section 3 proposes a trajectory privacy data publishing scheme based on local optimisation and R-tree. Section 4 mainly introduces the publishing scheme and algorithm analysis of trajectory data. Section 5 describes the results of the experimental evaluation. Finally, Section 6 summarises our work and gives future research interests.

2. Relevant work

Nowadays, the trajectory data privacy protection has become one of the widely researched academic subjects, such as domestic and international politics, economy and so on. As a result, privacy-preserving approaches have emerged based on various techniques, such as distortion, anonymisation and differential privacy techniques. Qiu et al. ([14]) proposed a novel approach to trajectory privacy protection against predictive attacks by generating indistinguishable perturbed locations. And these locations could replace a user's real location when submitted to an untrusted server. Zuo et al. ([24]) presented a spatio-temporal correlation location privacy protection method with semantic information. This method leverages users' historical semantic trajectories and location semantic information to generate artificial trajectories that comply with user behaviour patterns according to extracted temporal and probabilistic factors, thereby blurring the distinction between true and false trajectories. Cheng et al. ([3]) introduced an optimal differential privacy mechanism for personalised privacy trajectories. The method establishes a probabilistic mobility model for trajectories and proposes a privacy-level assignment method based on dwelling points and frequent sub-trajectories. These studies consider different aspects of data protection at a certain level.

In Addition, a series of specific models have emerged from the contemporary social background. In 2010, Mohammed et al. ([12]) proposed an anonymisation algorithm to implement the LKC-privacy model, which was initially applied to RFID data. The algorithm identifies the minimum violating series of all trajectory data and forms a set of violating sequences, which are the trajectory sequences that do not satisfy the LKC-privacy model. Then the set of violating sequences is globally suppressed to minimise the generation of larger violating sequences. In 2006, Dwork ([5]) proposed the differential privacy protection model which aimed to redefine privacy regarding the issue of database leakage. The model assumes that even with background knowledge of some data records in the database, an attacker cannot deduce the existence of a particular data record through analysis such as querying or statistics on the database information. The technique of differential privacy also has strict and standardised mathematical theoretical proofs and evaluation criteria. In follow-up studies, the LKC-privacy model and differential privacy technique are also gradually applied to trajectory data research.

For example, Wang et al. ([18]) introduced a novel privacy-preserving framework that combines a mixer with differential privacy to enhance data availability by

Graph

$O (\sqrt{n})$ times for localised differential privacy model. And it is close to centralised differential privacy but without relying on the trusted third party. Nevertheless, the framework has some limitations in terms of its application scenarios due to its lack of comprehensive considerations. Zhao et al. ([23]) proposed a differential privacy trajectory data protection method based on prefix trees. The PDML method is proposed on the basis of the DML principle. It combines Dijkstra's algorithm and Markov chains to segment and protect trajectory segments respectively, and finally uses differential privacy techniques to ensure user data security. Although this method can resist certain specific attacks, it has a certain cost in terms of time and space because it uses the Dijkstra algorithm and Markov chain. Yuan et al. ([20]) proposed a differential privacy trajectory data protection scheme based on R-trees, which combines differential privacy technology and DPTS trees on the basis of the R-trees index structure. However, this method relies on the similarity of trajectory data. If there are many sensitive points, the construction efficiency of the DPTS tree will be reduced, and the level of noise added will reduce the data usability. Therefore, further optimisation of the solution is needed to improve efficiency and data utility.

Therefore, this paper proposed a differential privacy protection scheme based on local optimisation and R-trees. Combining the local community's location data sheet with advanced optimisation largely reduces the amount of injected noise and protects the usability of the data. Furthermore, the use of an R-tree indexing structure has the advantage of making differential privacy more resistant to background knowledge attacks.

3. Problem statement

Definition 3.1

(Trajectory T): Trajectory is a series of continuous points generated by mobile users in time sequence and space. The data sequence T is an ordered combination of data, which can be expressed as:

Graph

$T = p_{1} \to p_{2} \to p_{3} \to p_{4} \to \dots \to p_{n}$

where |T| denotes the number of a trajectory data sequence, and pi is composed of a two-dimensional position coordinate tri=(xi, yi) and a time sequence

Graph

$t_{i}$ , which is denoted as

Graph

$p_{i} = (t_{i}, t r_{i})$ . Obviously, trajectory data has spatio-temporal features. Therefore, when dealing with the trajectory data of mobile users, it is necessary to consider not only the spatial transformation, but also the inherent logical relationship of the location.

Definition 3.2

(Trajectory dataset D): The trajectory dataset D is a collection of multiple trajectory data, denoted as:

Graph

$D = (T_{1}, T_{2}, T_{3}, \dots, T_{n}) : (o t h_{1}, o t h_{2}, o t h_{3}, \dots, o t h_{n})$

where n denotes the total number of trajectories contained in the data set D, which can also be expressed as

Graph

$| D | = n$ .

Graph

$T_{i}$ denotes the trajectory data of the i-th mobile user, and

Graph

$o t h_{i}$ denotes the personal information corresponding to the same user.

Definition 3.3

(Location Data Sheet R): The Location Data Sheet is a collection of landmarks in the territory where the trajectory data is to be published, which contains the latitude and longitude of the building, social classification and special identification, etc.

Definition 3.4

(Inference attack on non-sensitive information): Inference attack on non-sensitive information is an attack in which an attacker infers sensitive information from non-sensitive information by multiple queries. As shown in Table 1, it is assumed that the attacker knows some users' partial information and trajectories, denoted as

Graph

$a . D' = (T_{1}^{'}, T_{2}^{'}, \dots T_{k}^{'}) : (o t h_{1}^{'}, o t h_{2}^{'}, ... o t h_{k}^{'}), k < n$ . Where a.D' means that the attacker knows partial background knowledge. For example, user 1 and user 2 are classmates, and there are trajectories of school (3b) and library (6e) in their trajectory data. Therefore, from the same timestamp, we can see that they are most likely classmates with user 5, which can expose the trajectories of user 5 completely.

Table 1. Partial trajectory data sheet R.

User	Trajectory coordinate	Road Network Info
T1	1a, 3b, 4c, 5d, 6e	mall, school, supermarket, library
T1	1c, 2a, 3b, 5f, 6e, 7g	school, gallery, library
T3	2a, 4d, 5f, 6e, 7g, 9h	restaurant, library, gymnasium
T4	1b, 3b, 4f, 5d, 6c, 8h	school, supermarket, gymnasium
T5	2e, 3b, 4a, 5d, 6e, 8d	school, mall, library, supermarket

Definition 3.5

(LKC-privacy model) (Mohammed et al., [12]): Suppose L is the maximum length of the trajectory data in the possession of the attacker, and T denotes the set of all trajectory data. S is a sensitive attribute in the trajectory data, and P is an arbitrary subsequence of T. The condition that made T satisfy the model of LKC-privacy is that when and only when, any subsequence p in T satisfies the following conditions with

Graph

$0 < | P | \leq L$ .

•

Graph

$| T (P) | \geq K$ , T(P) is the user including p in trajectory.

•

Graph

$C o n f (s | T (P)) \leq C, C o n f (s | T (P)) = | T (P \cup s) | / | T (P) |, 0 \leq C \leq 1, s \in S$ , Conf is the abbreviation of confidence. C is the confidence threshold of the anonymous set, which can flexibly adjust the degree of anonymity according to the demand.

Definition 3.6

(Violation dataset Q): Suppose the length of the sequence P on the trajectory dataset satisfies

Graph

$0 < | P | \leq L$ . If the sequence P does not satisfy any condition defined by the LKC-privacy model, it is called a violation sequence and its set is defined as the violation dataset Q.

Definition 3.7

(ε-differential privacy) (Dwork, [5]): Given a randomised algorithm M, let Y be the set of all output results of M, O be a subset of the set Y. For adjacent data sets D1 and D2 that differ by at most one record, if algorithm M satisfies Equation (1),

(1)

Graph

$\begin{aligned} \frac{Pr (M (D_{1}) = 0)}{Pr (M (D_{2}) = 0)} \leq e^{ϵ} \end{aligned}$

algorithm M is said to provide ε-differential privacy protection. Where

Graph

$ϵ$ denotes the privacy budget and Pr (M(Di)=O) denotes the probability of the algorithm, as determined by the algorithm M.

Definition 3.8

(Global sensitivity): For the adjacent data sets D1 and D2, given any query function f, the global sensitivity

Graph

$f : D \to R^{d}$ , the global sensitivity of

Graph

$f$ is:

(2)

Graph

$\begin{aligned} Δ f = max_{D_{1}, D_{2}} | | f (D_{1}) - f (D_{2}) | |_{1} \end{aligned}$

where R represents the real number domain of the mapping, and

Graph

$| | \cdot | |_{1}$ represents the 1-order distance between

Graph

$f (D_{1})$ and

Graph

$f (D_{2})$ .

Definition 3.9

(Laplace mechanism): Given a data set D and query function

Graph

$f : D \to R^{d}$ , if the output result of algorithm M is:

(3)

Graph

$\begin{aligned} M (D) = f (D) + Z \end{aligned}$

then the randomised algorithm M satisfies

Graph

$ϵ$ -differential privacy, where noise

Graph

$Z \to L a p (Δ f / ϵ)$ . The Laplace mechanism adds noise obeying the Laplace distribution into the query result of the randomised function, eliminating the influence of individual records on the query result, and the amount of noise is proportional to

Graph

$Δ f$ and inversely proportional to

Graph

$ϵ$ .

4. Proposed approach

This section provides a detailed description of the proposed trajectory data publishing protection scheme in this paper. Where sub-section 4.1 presents an overview of the scheme. Sub-sections 4.2 and 4.3 elaborate on the specific algorithm design. Sub-sections 4.4 and 4.5 provide rigorous analyses of the privacy protection level and complexity of the proposed algorithm, respectively.

4.1. The basic idea of the algorithm

In this paper, we propose a trajectory privacy data publishing scheme based on local optimisation and R-tree as follows.

Local optimised trajectory data

Firstly, the method constructs a replaceable location data sheet. Then the trajectory data, to be published and not satisfying the LKC-model constraints, are filtered and optimised. The process of optimisation is to replace sensitive trajectory data by combining spatio-temporal and the location data sheet to ensure data authenticity.

Establishing R-tree and differential privacy data protection

The trajectory data is divided into trajectory segments according to the spatial location, and then the trajectory segments are inserted into the tree by combining the construction principle of R-tree. And finally the Laplace noise is added to the generated leaf nodes to protect the data.

4.2. Locally optimised trajectory data (LOTD)

To ensure the local optimisation more reasonable and effective, the optimised data was chosen from the location data sheet which was created by combining the local social information about the trajectory data. The optimisation method performs frequency statistics on the previously collected trajectory database. And then the suitable location data is filtered from the location data sheet for replacement. The construction process of the location data sheet is shown in Figure 1.

Graph: Figure 1. Location data sheet creation process.

First, the trajectory data to be published that does not meet the requirements of the LKC-privacy model are filtered out and then the order for optimisation is determined by the score of the scoring function Score (p).

Graph

$\begin{aligned} S c o r e (p) = \frac{O p t i m i s e (p)}{L o s s (p)} \end{aligned}$ (4)

where

Graph

$p$ denotes the violation trajectory sequence that needs to be optimised.

Graph

$O p t i m i s e (p)$ denotes the violating trajectory sequence number that can be eliminated by optimising

Graph

$p$ . And

Graph

$L o s s (p)$ denotes the usability loss due to optimising

Graph

$p$ .

As for local optimisation of the violated trajectory data, the spatio-temporal characteristics of the trajectory data are taken into account. Therefore, it is handled in two parts. When the trajectory data time span is large or the position change is small, the number of trajectory data needs to be partitioned and replaced. And when the trajectory data time span is small or the position is updated frequently, the timestamp of trajectory data needs to be partitioned and replaced gradually.

For example, a user's trajectory

Graph

$T = (p_{1}, p_{2}, \dots, p_{i}, \dots, p_{j}, \dots, p_{n})$ needs to be optimised by some of his partially violating trajectory

Graph

$P = (p_{i + 1}, p_{j - 1})$ , i.e. the judgment criteria are

Graph

$j - i$ and

Graph

$t_{j} - t_{i}$ respectively. First, the coordinates of the two trajectories before and after the violated sequence are extracted from the trajectory data set D, the coordinates of

Graph

$p_{i} = (t_{i}, x_{i}, y_{i})$ and

Graph

$p_{j} = (t_{j}, x_{j}, y_{j})$ respectively. And set the time threshold E and the trajectory number thresholds V. When

Graph

$t_{j} - t_{i} > E$ or

Graph

$j - i > V$ , the trajectory data is divided into two sub-trajectories equally according to the time span or trajectory number, until the conditional judgment is lower than the set threshold. Then the next step is to optimise data.

It is simple to circle the range of replaceable positions on the extracted trajectory data

Graph

$p_{i}$ and

Graph

$p_{j}$ as the diameter, as shown in Figure 2. The location points of the dashed circle can be used as alternative optimisation points.

Graph: Figure 2. Filtering available locations.

When filtering optimal positions in the location data sheet, the most suitable points need to be selected for replacement based on the filtering results. The filtering calculation process involves the direction of the trajectory before optimisation, the distance and the classification of other information about the location data sheet. As denoted in Eq.(5).

Graph

$\begin{aligned} ω = ω_{1} \cdot α + ω_{2} \cdot d + ω_{3} \cdot o t h \end{aligned}$ (5)

where

Graph

$ω$ is the weight value assigned to each data information, and

Graph

$α$ is the difference of direction angle between the old and new position points, and

Graph

$d$ is the difference of distance from the new position point to the previous trajectory point, and

Graph

$o t h$ is the comparison difference between the old and new trajectory data. The new trajectory path is illustrated in Figure 3.

Graph: Figure 3. The filtered trajectory path.

Algorithm 1 is the pseudocode for local optimisation. Line 1 is the location data sheet obtained by filtering the local social map. Lines 4–20 loop the whole data set D to get the violation dataset Q, where lines 6–12 judge whether the current trajectory segment is the violation trajectory sequence. Line 21 scores the violation trajectory sequence to determine the order of optimisation. Lines 22–27 optimise the trajectory data. Finally, line 28 returns the locally optimised trajectory data set New-D (Table 2).

Table 2. Pseudocode for local optimisation.

Input	Location data sheet R, trajectory data set D, upper limit of trajectory length in the possession of the attacker L, anonymity number K, confidence threshold C, time threshold E, trajectory number threshold V
Output	Violation dataset Q, locally optimised trajectory dataset New-D
1	R←Location data sheet of road network diagram;
2	X₁←set of all distinct pairs in D;
3	i = 1,j = 0;
4	While i ≤ L and X_i ≠φ do
5	Scan D to compute \|T(P)\| and Conf(s\|T(P));
6	for $\forall$ p ∈P where \|T (P)\| > 0 do
7	if \|T(P)\| < K or Conf(s\|T(P)) > C then
8	Add P to Q_i;
9	else
10	Add P to W_i;
11	end if
12	end for
13	X_{i + 1} ← W_i $⋈$ W_i;
14	for $\forall$ P ∈ X_{i + 1} do
15	if P is a super sequence of any v ∈ Q_i then
16	Remove P from X_i+ 1;
17	end if
18	end for
19	i++;
20	end while
21	Build the score sheet by Score(Q) = Optimise(p_j) /Loss(p_j);
22	Q←Choose the highest point;
23	If Q(t) > E or Q(tr) > V do
24	Q←Split Q into sub tracks;
25	else
26	Q'←Replace the position of Q with Location data sheet;
27	end if
28	Return New-D←D $⋈$ Q'.

4.3. Establishing R-tree and differential privacy protection (RDP)

R-tree is a structure that serves as a multidimensional data storage space, and it can perform multiple simultaneous operations with high efficiency, such as additions and deletions. Currently, R-tree deals with coordinating in real map spaces or emerging virtual worlds on the web, such as constructing streets, drawing coastlines and buildings, etc. In addition, R-tree has high research value and significance in processing data operations with spatio-temporal logic. And it can preserve the logical structure of spatial location points, and has the function of counting queries.

First, when building the R-tree structure, trajectory is classified according to the user's information on trajectory dataset. Then the R-tree is constructed by comparing the similarity between mobile user trajectories one by one, i.e. considering the similarity percentage of trajectory sequences between different trajectories. The formula for calculating the similarity of trajectory sequences is as follows.

Graph

$\begin{aligned} Λ (T_{i}, T_{j}) = \frac{T_{i} (p) \cap T_{j} (p)}{T_{i} (p) \cup T_{j} (p)} \end{aligned}$ (6)

where

Graph

$T_{i}$ and

Graph

$T_{j}$ denote the trajectory data of the i-th and j-th users respectively, and p = (p1, p2, p3, ... , pn) represents the trajectory sequence. Also, the Frechet distance between different trajectories is considered to be calculated. The closer trajectories in terms of Frechet distance is partitioned into the same area. Frechet distance is calculated as follows.

Graph

$\begin{aligned} F (T_{i}, T_{j}) = \underset{α, β \in [0, 1]}{inf max} {d (T_{i} (α (t)), T_{j} (β (t)))} \end{aligned}$ (7)

where

Graph

$α$ and

Graph

$β$ denote the reparameterisation function for the unit interval, t denotes time, and d denotes the Euclidean distance.

The steps for building the R-tree and adding noise are described as follows.

Initialisation of R-tree structure: the data structure of R-tree is

Graph

${T_{i} (p), o t h_{i}, c (T_{i}), c (o t h_{i})}$ , where

Graph

$T_{i} (p)$ is the sequence of trajectories.

Graph

$o t h_{i}$ represents other information.

Graph

$c (T_{i})$ and

Graph

$c (o t h_{i})$ are the number of all mobile users and other information on the corresponding sequence of trajectories. If the node is not the root node, the number of indexes is between m and M,

Graph

$m \leq M / 2$ . Then the R-tree is constructed by randomly selecting the trajectory and calling the corresponding inserted function.

Node selection: When inserting new mobile user trajectory data, the trajectory data needs to be inserted into the appropriate node. Specifically, starting from the root node, select the appropriate node from top to bottom. If it is added to a leaf node, the node is returned and the insertion is completed. However, when the node is not a leaf node, it is necessary to compare the similarity degree of the trajectory sequence corresponding to the index in the node and select the larger branch for insertion. Then compare the Frechet distance of the trajectory when the similarity degree of the same trajectory sequence is 0 or the same, the node with smaller average Frechet distance is selected. Then repeat the above steps until the insertion into the node.

Splitting of nodes: If the number of indexes of the node into which the new trajectory is inserted has reached M, this node will need to be split. Assuming their similarity is not zero between the sequence of traces to be inserted and the nodes to be split, and the node indexes with lower similarity are divided into different nodes. Then the other index records are also split under the corresponding nodes and updated upward to the corresponding parent nodes. But if the similarity is zero, the Frechet distance is calculated between different trajectories. The two trajectories with larger distance are divided into two nodes, and the other trajectories are also divided under the corresponding nodes and update their parents.

Update parent node information: When inserting new mobile user trajectory data, the index record parent node needs to be updated in its corresponding. Assuming that the index number in the parent node has not yet reached the maximum value when a node is inserting new track data, the new index record is inserted directly and the corresponding parent node is updated. Conversely, the parent node of the split and the parent node of the new leaf node are also updated after the split. Repeat the above operation until the new track is completely inserted.

Adding Laplace noise: Each insertion of new trajectory data forms a new R-tree and generates new leaf nodes, which are continuously refreshed. Finally, Laplace noise is injected into the leaf nodes. In addition, the privacy budget is allocated by dividing

Graph

$ϵ$ equally according to the number of leaf nodes, and the allocation is performed by injecting noise to the trajectory information and other information separately.

Algorithm 2 is the algorithm for building R-tree and data perturbation. Lines 1–19 are to construct the R-tree based on the trajectory data. Where lines 2–12 are the process of selecting appropriate leaf nodes and lines 14–18 are the process of node splitting. Line 20 is to distribute the privacy budget equally according to the generated number of leaf nodes. Lines 21–27 are to determine whether the data are location information, other types of data information or injected noise respectively. Finally, line 28 returns the protected dataset

Graph

$D^{'}$ (Table 3).

Table 3. Pseudocode for constructing R-tree and adding noise.

Input	Locally optimised trajectory dataset New-D, Privacy Budget ε
Output	Protected Dataset D′
1	Suppose the root node is r; the node to be inserted into the trajectory is v;
2	If v is a leaf node
3	Return v;
4	Else if v is not a leaf node
5	Calculate the trajectory sequence similarity according to formula(6);
6	m←Node with high similarity;
7	If the similarity of the trajectory sequence is the same or 0
8	Calculate the average Fréchet distance according to formula (7);
9	m←Node with small average Fréchet distance;
10	End if
11	v←m;
12	End if
13	P←v;
14	If the number of index records of P reaches the maximum
15	Split node;
16	Update parent node information;
17	End if
18	Update parent node information;
19	Return R-Tree(D);
20	ε'=ε/\|P\|;
21	For i = 1 to p do
22	If data type is location data
23	T′(P)←T(P) + Noisy(ε'/2);
24	Else
25	c′(oth)←c(oth) + Noisy(ε'/2);
26	End if
27	End for
28	Return _D'

4.4. Privacy protection degree analysis

This section analyses the privacy protection degree of the LORDP algorithm. The privacy protection degree of the algorithm depends on the privacy budget. When the privacy budget is smaller, the privacy protection degree of the algorithm is higher, which shows a negative correlation. In this paper, a differential privacy-preserving algorithm is constructed through the data structure of the LORDP-tree, so we need to prove that the LORDP algorithm satisfies ε-differential privacy. Each layer of the LORDP-tree contains mobile users. The mobile users in the layers are disjoint, and the privacy budget allocated to nodes of each layer is

Graph

$ϵ^{'} = ϵ / p$ . The random noisy Z injected into the original data obeys the Laplace distribution of Theorem 9:

Graph

$Z \sim L a p l a c e (\frac{Δ f}{ϵ})$

The algorithm Mi for adding noise at the i-th leaf node of the LORDP-tree satisfies ε-differential privacy. The proof is as follows.

Assuming that

Graph

$p_{1}$ is the probability density function of the original data set D, and

Graph

$p_{2}$ is the probability density function of random noise Z, then

Graph

$\frac{Pr [M (D) \in O]}{Pr [M (D^{'}) \in O]} = \frac{Pr [d + z = o]}{Pr [d^{'} + z = o]} = \frac{p_{2} [o - p_{1} (d)]}{p_{2} [o - p_{1} (d^{'})]}$

where z ∈ Z, o ∈ O. Z obeys Laplace distribution, so

Graph

$\frac{p_{2} [o - p_{1} (d)]}{p_{2} [o - p_{1} (d^{'})]} \leq e^{\frac{ϵ}{f} | d - d^{'} |} = e^{\frac{ϵ}{f} \cdot \frac{f}{p}} = e^{\bar{ϵ}}$

From the above proof process, we can see that algorithm Mi satisfies Theorem 7, so algorithm Mi satisfies ε-differential privacy. From the sequential combination feature of differential privacy, the algorithm Mi of each leaf node in the LORDP-tree satisfies

Graph

$\bar{ϵ}$ -differential privacy, and the privacy budget of all nodes satisfies

Graph

$ϵ = \sum_{i = 1}^{p} \bar{ϵ} = \sum_{i = 1}^{p} \frac{ϵ}{p}$ , so the constructed LORDP-tree satisfies ε-differential privacy.

4.5. Complexity analysis

The algorithm consists of two parts in the paper, which optimise the trajectory data locally and protect the data with differential privacy.

In the first part, given the original trajectory dataset D, violating trajectory sequences are identified and optimised with the location data sheet A from the dataset D. The time complexity of the optimisation process is O(|D|·|p| + |Q|), where |D| is the total number of trajectories in the dataset D. |p| is the number of trajectory points in each trajectory. And |Q| denotes the number of violating trajectory data to be optimised. The purpose of this part is to pre-optimise the data, reduce its sensitivity, and improve data security.

In the second part, R-tree and differential techniques are combined. R-tree is built top-down based on trajectory similarity, and then privacy budget is assigned at the leaf nodes of R-tree. The time complexity is O(|T|·h), with |T| is the length of a trajectory data and h is the height of R-tree. The time complexity is consistent with the cost of constructing R-tree and is influenced by the number and length of trajectories.

Compared with other algorithms, the pre-optimised processing has proven to be highly effective in reducing the loss of data caused by the injecting of Laplace noise. By carefully considering the spatio-temporal characteristics of trajectory data, the optimisation process ensures that the resulting data is more accurate and representative of the real trajectory patterns. Furthermore, the algorithm leverages the advantages of differential privacy to enhance data security, providing additional protection against unwanted data disclosures. Overall, the approach offers a highly optimised and secure method for publishing trajectory data while preserving individual privacy.

5. Experiment

5.1. Experiment data and environment

This experiment is conducted in Python environment with Intel(R) Core(TM) i3-10110U CPU 2.59 GHz and 16.0 GB of RAM. In this chapter, the dataset used for the experiments is the Landmark real dataset, which is a landmark consisting of geographic coordinates of 48 large states in the US provided by the infochimps big data website with about 880k data points, where one kind of data is used as sensitive values.

5.2. Measurement criteria

Among the measures of trajectory data publishing, calculating data loss is one of the important reference indicators to measure the availability of trajectory data. The experimental part of this paper is divided into two parts. The first part is to measure the data availability by comparing the loss rate of the trajectory location points before and after optimisation. The loss rate of trajectory data points is calculated by the following formula.

Graph

$\begin{aligned} P L o s s = \frac{P (T) \cap P (T^{'})}{P (T)} \end{aligned}$ (8)

where

Graph

$P (T)$ is the number of location points of the trajectory data that were optimised in the original trajectory dataset, and

Graph

$P (T^{'})$ is the number of location points that were altered after local optimisation.

The second part is to calculate the average error for the trajectory data before and after adding noise in the context of the same privacy budget allocation, which is used to measure the privacy data protection. Assuming that the query count is Q, the average error is calculated as follows.

Graph

$\begin{aligned} A v e r a g e E r r o r = \frac{\sum_{i = 1}^{n} | Q (T_{i}) - Q (T_{i}^{'}) |}{n} \end{aligned}$ (9)

where

Graph

$| . |$ denotes the absolute error generated between the optimised and protected trajectories after query algorithm Q.

Graph

$n$ denotes the total number of trajectories contained in the database.

To make the above measure more convincing, it is necessary to further measure the usability of the trajectory data before and after injecting noise through the loss of quality of the average data points. The formula for calculating the average data point quality loss is as follows.

Graph

$\begin{aligned} A v e r a g e P L o s s = \frac{\sum_{i = 1}^{n} | D (P) - D^{'} (P) |}{n} \end{aligned}$ (10)

where

Graph

$| . |$ denotes the absolute error between the data sets before and after injecting noise.

Graph

$n$ denotes the total number of trajectories contained in the database.

5.3. Results of the experiment

To verify the effectiveness of trajectory privacy data publishing scheme based on local optimisation and R-trees (LORDP), this paper's experiments are mainly divided into two parts: before and after optimisation, and after adding noise.

In the part before and after optimisation, we compared with the TOS algorithm and LSUP algorithm proposed by Bai et al. ([1]) and Terrovitis et al. ([16]), respectively. The former reduces sensitivity by suppressing local trajectory data, and the latter prevents privacy leakage by suppressing and segmenting trajectories. However, these two algorithms ignore the spatial characteristics between trajectory data, resulting in extreme loss of data utility. In contrast, the LORDP algorithm considers the spatio-temporal relationships between trajectory data, thereby better protecting privacy while maintaining data usability.

In the part after adding noise, we compared with the TLDP algorithm proposed by Zhao et al. and the LPPM algorithm proposed by Cunha et al. ([4]). These two schemes use prefix trees and clustering geo-indistinguishability to achieve differential privacy. Although they can reduce the risk of specific attacks, adding a large amount of noise to handle sensitive locations will reduce data usability. In brief, the LORDP algorithm based on local optimisation and R-tree can better balance data usability and security by masking and protecting sensitive trajectory data.

The experimental procedure was as follows.

5.3.1. Effect of different k and c values on data

To verify the data availability improvement before and after the optimisation algorithm, the LORDP algorithm in this paper compared with the TOS algorithm in reference (Bai et al., [1]) and the LSUP algorithm in literature (Terrovitis et al., [16]) that also include the step of data pre-processing. Using the control variables method, a single variable was changed for the study. Figure 4 shows the scale of the controlled anonymity k value from 20 to 45. And Figure 5 shows the size of the varying confidence threshold C from 0.2 to 1.0.

Graph: Figure 4. Effect of different K values on the loss rate of trajectory data point.

Graph: Figure 5. Effect of different C values on the loss rate of trajectory data points.

From the experimental results of Figures 4 and 5, it can be seen that the loss rate of trajectory data points increases when the value of anonymity K increases gradually, i.e. the data loss rate decreases when the data to be optimised decreases. When the confidence threshold C is gradually increased, the trajectory data loss rate gradually decreases. Since the increase of confidence level reduces the number of trajectory data points to be optimised, the loss rated decreases. Combining the experimental results of the other two algorithms, it appears that the local optimisation algorithm in this paper, combined with the location data sheet, is able to consistently control the reduction in data points loss rate by approximately 2% to 5% for the same K or C values. This indicates that the local optimisation algorithm can effectively reduce data sensitivity and reduce data loss by a certain degree.

5.3.2. Effect of adding noise on data

After a pre-optimised algorithm, global privacy protection measures are applied to the data. Therefore, the experiment compares the LORDP algorithm of this paper with the literature (Zhao et al., [23]) and reference (Cunha et al., [4], October). The comparison principle is that the average error and average loss results of different algorithms are compared for different privacy budget scenarios. Figure 6 shows a range of privacy budgets from 0.1 to 0.8 selected for the experiments. Figure 7 refines the privacy budget to each data point on average, selecting a range of privacy budgets from 0.11 to 0.18.

Graph: Figure 6. Comparison of the average error of different algorithms.

Graph: Figure 7. Comparison of the average data point quality loss results of different algorithms.

The experimental results of Figure 6 show that the average error of the data decreases gradually with the gradual increase of the privacy budget. The experimental results of Figure 7 show that the error results are averaged to each trajectory data, which can further illustrate that the average trajectory point loss rate of this paper's algorithm is smaller than other algorithms. The comparison of the other two algorithms shows that the LORDP algorithm can reduce the loss rate of 0.002% to 0.008% for each trajectory data onto the same privacy budget allocation, and can reduce the loss rate of up to 0.7% for the whole data set.

5.3.3. Effect of the data set length on the overall scheme

The sensitivities present in longer trajectory data may be more complex. Thus, the processing of local optimisation and R-tree construction is prone to injure of data availability. The better methods in the above two experiments is chosen to compare with the proposed algorithm. Thus, the experiment compares the LORDP algorithm with the TOS algorithm and TLDP algorithm. The privacy budget for this experiment is set to ε = 0.5 and ε = 1 respectively, and the experimental results are shown in Figures 8 and 9.

Graph: Figure 8. When ε = 0.5, effect of trajectory data length on different scenarios.

Graph: Figure 9. When ε = 1, effect of trajectory data length on different scenarios.

The experimental results show that the average error of data increases gradually with the trajectory length under the same privacy budget conditions. But the average error rates of the LORDP algorithm are all lower than those of the other algorithms. And the comparison between Figures 8 and 9 shows that privacy budget increases, while the average error of the trajectory data will be smaller and the privacy protection better. In particular, when the length of the trajectory data tends to 40–50, the protection effect is better compared to the other effects.

Therefore, the LORDP algorithm avoids adding too much Laplacian noise to the trajectory data after processing by local optimisation, which prevents data distortion as much as possible. And it is also more effective against reducing the mean error of the algorithm, i.e. improving the availability and safety of the trajectory data.

6. Summary and prospect

In this paper, we propose a trajectory privacy data publishing scheme based on local optimisation and R-tree. The scheme makes the release of raw trajectory data less likely to expose risks. It combines the latest local location data sheet for local optimisation and uses differential privacy technique to protect sensitive and non-sensitive information. Compared to other algorithms, the local optimisation of LORDP algorithm in this paper can better protect the availability, greatly improve data utilisation, and reduce the risk of privacy leakage. However, with the popularity of mobile devices and Internet of Things, the generation and collection of trajectory data are becoming more and more common, and the demand for real-time processing and analysis is also increasing. Therefore, considering real-time processing is very important in the research of trajectory data processing and analysis in the future. Additionally, it is also necessary to further improve and optimise the privacy budget allocation strategy based on the different characteristics and application scenarios of the data, achieving better trade-off between data utility and privacy protection.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References 1 Bai, Y., Li, X., Chen, C., Li, X., & Wang, Y. (2021). Research on optimal suppression differential privacy protection for trajectory data publishing. Journal of Chinese Computer Systems, 42 (8), 6. https://doi.org/10.3969/j.issn.1000-1220.2021.08.034 2 Bozkurt, Y., Braun, R., Rossmann, A., & Hertweck, D. (2020). The advent of smart cities: status quo and future research directions. https://dx.doi.org/10.33965/ict_csc_wbc_2020_202008l014 3 Cheng, W., Wen, R., Huang, H., Miao, W., & Wang, C. (2022). OPTDP: Towards optimal personalized trajectory differential privacy for trajectory data publishing. Neurocomputing, 472, 201 – 211. https://doi.org/10.1016/j.neucom.2021.04.137 4 Cunha, M., Mendes, R., & Vilela, J. P. (2019, October). Clustering geo-indistinguishability for privacy of continuous location traces. In 2019 4th International Conference on Computing, Communications and Security (ICCCS) (pp. 1–8). IEEE. https://doi.org/10.1109/CCCS.2019.8888111 5 Dwork, C. (2008, April). Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation (pp. 1–19). Springer. 6 Ghosh, S., Ghosh, S. K., & Buyya, R. (2020). MARIO: A spatio-temporal data mining framework on google cloud to explore mobility dynamics from taxi trajectories. Journal of Network and Computer Applications, 164, Article 102692. https://doi.org/10.1016/j.jnca.2020.102692 7 Hu, S., Gao, S., Wu, L., Xu, Y., Zhang, Z., Cui, H., & Gong, X. (2021). Urban function classification at road segment level using taxi trajectory data: A graph convolutional neural network approach. Computers, Environment and Urban Systems, 87, Article 101619. https://doi.org/10.1016/j.compenvurbsys.2021.101619 8 Li, H., Xue, X., Li, Z., Li, L., & Xiong, J. (2021). Location privacy protection scheme for LBS in IoT. Wireless Communications and Mobile Computing. https://doi.org/10.1155/2021/9948543 9 Liu, M., Zhang, Z., Chai, W., Chai, W., & Wang, B. (2022). Privacy-preserving COVID-19 contact tracing solution based on blockchain. Computer Standards & Interfaces, 83, Article 103643. https://doi.org/10.1016/j.csi.2022.103643 Liu, X., Xie, L., Wang, Y., Zou, J., Xiong, J., Ying, Z., & Vasilakos, A. V. (2020a). Privacy and security issues in deep learning: A survey. IEEE Access, 9, 4566 – 4593. https://doi.org/10.1109/ACCESS.2020.3045078 Liu, Y., Feng, T., Peng, M., Jiang, Z., Xu, Z., Guan, J., & Yao, S. (2020b). COMP: Online control mechanism for profit maximization in privacy-preserving crowdsensing. IEEE Journal on Selected Areas in Communications, 38 (7), 1614 – 1628. https://doi.org/10.1109/JSAC.2020.2999697 Mohammed, N., Fung, B., & Debbabi, M. (2010). Preserving privacy and utility in RFID data publishing. Niu, T., Qing, L., Han, L., Long, Y., Hou, J., Li, L., Tang, W., & Teng, Q. (2022). Small public space vitality analysis and evaluation based on human trajectory modeling using video data. Building and Environment, 225, Article 109563. https://doi.org/10.1016/j.buildenv.2022.109563 Qiu, S. Y., Pi, D., Wang, Y. X., & Liu, Y. F. (2023). Novel trajectory privacy protection method against prediction attacks. Expert Systems with Applications, 213, Article 118870. https://doi.org/10.1016/j.eswa.2022.118870 Tedjopurnomo, D. A., Li, X., Bao, Z., Cong, G., Choudhury, F., & Qin, A. K. (2021). Similar trajectory search with spatio-temporal deep representation learning. ACM Transactions on Intelligent Systems and Technology, 12 (6), 1 – 26. https://doi.org/10.1145/3466687 Terrovitis, M., Poulis, G., Mamoulis, N., & Skiadopoulos, S. (2017). Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering, 29 (7), 1466 – 1479. https://doi.org/10.1109/TKDE.2017.2675420 Tu, W., Zhu, T., Xia, J., Zhou, Y., Lai, Y., Jiang, J., & Li, Q. (2020). Portraying the spatial dynamics of urban vibrancy using multisource urban big data. Computers, Environment and Urban Systems, 80, Article 101428. https://doi.org/10.1016/j.compenvurbsys.2019.101428 Wang, S., Bao, Z., Culpepper, J. S., & Cong, G. (2022). A survey on trajectory data management, analytics, and learning. ACM Computing Surveys, 54 (2), 1 – 36. https://doi.org/10.1145/3440207 Wu, W., Niu, X., & Li, M. (2021). Influence of built environment on street vitality: A case study of west Nanjing road in Shanghai based on mobile location data. Sustainability, 13. https://doi.org/10.3390/su13041840 Yuan, S., Pi, D., Zhao, X., & Xu, M. (2021). Differential privacy trajectory data protection scheme based on R-tree. Expert Systems with Applications, 182, Article 115215. https://doi.org/10.1016/j.eswa.2021.115215 Zhang, S., Yao, T., Sandor, V. K. A., Weng, T. H., & Su, J. (2021). A novel blockchain-based privacy-preserving framework for online social networks. Connection Science, 33, 555 – 575. https://doi.org/10.1080/09540091.2020.1854181 Zhang, X., Chen, C., Xie, Y., Chen, X., Zhang, J., & Xiang, Y. (2022). A survey on privacy inference attacks and defenses in cloud-based deep neural network. Computer Standards & Interfaces, 83, Article 103672. https://doi.org/10.1016/j.csi.2022.103672 Zhao, X., Pi, D., & Chen, J. (2020). Novel trajectory privacy-preserving method based on prefix tree using differential privacy. Knowledge-Based Systems, 198, 105940. https://doi.org/10.1016/j.knosys.2020.105940 Zuo, K., Liu, R., Zhao, J., Shen, Z., & Chen, F., (2022). Method for the protection of spatiotemporal correlation location privacy with semantic information. Journal of Xidian University, 49 (1), 67 – 77. https://doi.org/10.19665/j.issn1001-2400.2022.01.007

By Peiqian Liu; Duoduo Wu; Zihao Shen and Hui Wang

Reported by Author; Author; Author; Author

Titel:	Trajectory privacy data publishing scheme based on local optimisation and R-tree
Autor/in / Beteiligte Person:	Liu, Peiqian ; Wu, Duoduo ; Shen, Zihao ; Wang, Hui
Link:	Volltext (PDF) View record in DOAJ (Volltext) https://doaj.org/toc/0954-0091 https://doaj.org/toc/1360-0494
Zeitschrift:	Connection Science, Jg. 35 (2023-12-01), Heft 1
Veröffentlichung:	Taylor & Francis Group, 2023
Medientyp:	academicJournal
ISSN:	0954-0091 (print) ; 1360-0494 (print)
DOI:	10.1080/09540091.2023.2203880
Schlagwort:	the minimum violating sequence local optimisation r-tree differential privacy Electronic computers. Computer science QA75.5-76.95
Sonstiges:	Nachgewiesen in: Directory of Open Access Journals Sprachen: English Collection: LCC:Electronic computers. Computer science Document Type: article File Description: electronic resource Language: English

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.