In recent years, image processing especially for remote sensing technology has developed rapidly. In the field of remote sensing, the efficiency of processing remote sensing images has been a research hotspot in this field. However, the remote sensing data has some problems when processing by a distributed framework, such as Spark, and the key problems to improve execution efficiency are data skew and data reused. Therefore, in this paper, a parallel acceleration strategy based on a typical deep learning algorithm, deep belief network (DBN), is proposed to improve the execution efficiency of the DBN algorithm in Spark. First, the re-partition algorithm based on the tag set is proposed to the relief data skew problem. Second, the cache replacement algorithm on the basis of characteristics is proposed to automatic cache the frequently used resilient distributed dataset (RDD). By caching RDD, the re-computation time of frequently reused RDD is reduced, which lead to the decrease of total computation time of the job. The numerical and analysis verify the effectiveness of the strategy.
Deep learning; DBN; Acceleration strategy; Data skew; RDD cache
With the improvement of observation ability in the field of big data [[
At present, with the combination of various fields and remote sensing technology becoming more and more close, the demand for using remote sensing images to accurately distinguish geographical information is highlighted. Many deep learning models are used in remote sensing image processing, such as Zhang et al. [[
However, the deep learning algorithm has high computational complexity and requires multiple iterations to make the parameters converge to an optimal value. This results in the low efficiency and time overhead of data classification by the deep learning algorithm. The existing researches considered few about the data skew and replacement during the task assignment on an in-memory framework which leads to the increase of computation time.
In order to improve the efficiency of DBN, a parallel acceleration strategy for DBN (PA_DBN) is proposed, which includes re-partition algorithm and RDD (resilient distributed data) cache algorithm based on reused frequency and RDD size. Re-partition algorithm is used to solve the problem of data skew, and RDD cache algorithm is used to solve the problem of data reused. Then the strategy is verified effectiveness on the basis of remote sensing data processing.
The core architecture of the DBN algorithm is restricted Boltzmann machines (RBM). RBM simplifies the link between the visual layer and a hidden layer of the Boltzmann machine.
Definition 1Joint distribution function of RBM. Suppose there are m nodes in the visual layer, in which the ith node is represented by v
Among them, θ = {w
Definition 2Softmax function. This function is a classifier function, and since logical regression is a linear regression model, it is often used to solve the classification problem. For the specific testing set {((x
DBN can be regarded as a multi-layer RBM; the sample label is combined with the softmax classification function to supervise the training, and the backpropagation is used to form the DBN and tune the model.
Definition 3RDD execution time. Assuming that each RDD has n partitions, RDD is denoted as RDD
The computation time of partition is composed of read and process cost, where read cost is the time to get parent partitions and process cost is the processing time depending on the type of complexity, closure, and the size of parent partitions. Assume parents
Lemma 1The consistency principle of partition computation time. One partition of RDD
Proof Assume PT
Lemma 2The task skewing principle. The efficient task allocation can effectively reduce the execution time of local tasks, and accelerate the execution of tasks.
Proof Assume the current set of workers {w
Before the task is switched, the local pull task that the worker w
The local pull tasks that are performed by the worker w
From the point of view of task workload, Task
Because the computing ability of w
It means that the execution time of Task
Lemma 3The principle of saving time. Assume the overall execution time of RDD
Proof Assume the execution times of each RDD in a job as R = {r
In this section, a parallel acceleration DBN strategy (PA_DBN) is proposed to improve the execution efficiency of the DBN algorithm in Spark, and the detail process of the PA_DBN strategy is shown in Fig. 1.Flow chart of a PA_DBN strategy
The detail process of PA_DBN strategy is:
Step 1. Initialize the read data path and the number of data partitions. Spark uses RDD’s text file operator to read the data from HDFS to the memory of the Spark cluster.
Step 2. Create an RBM training method, which contains backpropagation; the result of the backward calculation is used as the next RBM input data, and the weight of DDBN algorithm is updated forward to reduce the error.
Step 3. If data skew occurs, perform the RP algorithm; the re-partition (RP) algorithm is used to partition RDD to avoid the situation that some RDD has much larger size and leads to higher computation time.
Step 4. If there is any RDD with reused frequency more than 2, the RDD cache (RC) algorithm will be performed, which is used to cache frequently reused RDD with higher weight on the basis of the RDD frequency and RDD size. When memory space is insufficient, the RDD with smaller weight will be replaced first.
Step 5. The weight parameter is initialized, the weight of the first layer is calculated, and the weight of the hidden layer is calculated in combination with the function of DBN training in step 2, and then the weight values of each node are merged.
Step 6. Save the weight parameters to HDFS by training.
In Spark, the RDD partition is partitioned according to the hash partition algorithm, which results in the different size of the RDD partition in DBN. Based on Definition 3, the maximum partition execution time of RDD determines the execution time of RDD. Therefore, the different size of RDD partition affects the execution speed of DBN. RP algorithm is proposed to solve the problem of data skew caused by skew partitioning of data.
The details of the RP algorithm are:
Step 1. The sample data set is sampled on a small scale, and the sample set is judged to determine whether the data is skewed or not.
Step 2. If the data is skewed, repartition the data. By a series of segmentation tags, if the data has n partitions based on the parallelism degree, then we need to have n − 1 segmentation tags (s
Step 3. When the data is partitioned under Spark, the data is distributed to different partitions according to the tag set, such as key
In the process of PA_DBN execution, by minimizing the space occupied by the storage area and reserving the memory space allocated to the execution area, the task execution efficiency can be effectively improved. The memory area minimization algorithm is shown in Algorithm 1, and the specific steps are as follows:
Step 1. Information of RDDs are obtained from DAG graph of the DBN job. Through pruning analysis and depth-first access, the RDD with action operation is set as the root node, RDDs with the frequency of 0 is pruned, while RDD with frequency more than 1 is reserved as the alternatives. When f = 0, delete is no longer used; when f > 0, the higher the frequency, the greater the weight of caching. Traversing the key-value pair set R < RDD
Step 2. In actual execution, once the generated RDDs are in the alternative cache list, they should be compared. If there are more than one RDDs in the candidate list, it is sorted by relative value/size and placed in the cache. The restriction is that the size of RDD and the currently used storage area cannot exceed the allocated storage memory size.
Step 3. When the memory space is insufficient, the least weighted RDDs are cleaned and replaced in turn according to the order of the weight list until the size of the space required by the new RDD is satisfied.
The experimental environment uses one master and four workers to establish a spark cluster. Considering the large amount of computation and abundant information of remote sensing image data, remote sensing image data is used as a data source in this experiment, and the data is derived from a Landsat8 satellite in 2013. The study area is Manas County, Xinjiang, located at 85.7-86.7° E and 43.5-45.6° N (Fig. 2). In this paper, the pre-processing of remote sensing data is realized by using ENVI5.2 software to improve the authenticity of remote sensing data. Then common feature indexes are extracted, such as normalized vegetation index (NDVI), difference vegetation index (DVI), ratio vegetation index (RVI), and enhanced vegetation index (EVI). Based on the four characteristic index parameters, five different text sample sets are extracted as shown in Table 1. The other data in the sample set include desert, river, and so on; the training samples and testing samples in grassland; and others are 5000 and 3000 item as shown in Table 2.Remote sensing image of Manasi County, Xinjiang
In this experiment, sample 1 is selected as the training set of PA_DBN model. Under the two orders of magnitude of 5000, 10,000 and 20,000, the test time of PA_DBN and traditional DBN training data is shown in Table 3 and Fig. 4, where 1, 2, and 3 represent the item of 20,000, 10,000, and 5000 in Fig. 3.The comparison with the two algorithms
Table 3 and Fig. 3 show that the execution speed of the PA_DBN algorithm is 12.7% times faster than that of the DBN algorithm under the same data volume compared with that of the DBN algorithm under the same amount of 10,000 item. In the same amount of 20,000 item, the execution speed of the PA_DBN algorithm is about 16.4% times higher than that of the DBN algorithm under the same data volume.
With the hidden layer number 3 of DBN, different samples are used for input data, that is, different feature selection. The comparison of execution time between the original hash partition and re-partition algorithm is shown in Fig. 4 by processing the same sample set, where the number of sample category represents sample 1-5.The comparison by taking a re-partition algorithm
In Fig. 4, it can be seen that under different sample sets, the execution speed of PA_DBN is different, and the execution speed of PA_DBN is improved by using the re-partition algorithm. At the same time, the execution time of PA_DBN varies.
Different feature combinations are used as DBN input data, such as NDVI, RVI, DVI, and EVI to test the accuracy. When the number of hidden layers of DBN is 3 and the number of iterations is 1000, the other parameters are configured fixed. When different feature combinations are used as output parameters, the accuracy of DBN for grassland discrimination is shown in Table 4. For model structure n-h-o, n represents the number of characteristics, h represents the number of hidden layer, and o represents the number of output.
In Table 4, we could obtain that different feature combinations are used as input parameters so that the accuracy of DBN is different; when the feature combination of input parameters is NDVI + RVI + DVI + EVI, the highest accuracy is 96.19.
In this experiment, the fifth sample set is selected as the training set of the PA_DBN algorithm. By adjusting the number of hidden layers to increase the number of cache RDD, the execution time of RDD before and after using the RC algorithm is tested, as shown in Fig. 5.The comparison by taking RDD cache algorithm
As shown in Fig. 5, by using RC algorithm, the execution time of PA_DBN algorithm for data training is shortened, and the execution efficiency of PA_DBN algorithm speeds up. Meanwhile, the execution time of both PA_DBN with RC and DBN is increased with the increase in the number of hidden layers, and improving the execution efficiency of DBN becomes more and more significant. In order to further improve the accuracy of DBN, we test the accuracy of PA_DBN under a different number of hidden layers. The feature combination of NDVI, RVI, DVI, and EVI is used as the input data, iteration times 1000. The accuracy of PA_DBN was shown in Table 5.
In Table 5, it can be seen that the difference in a topological structure is the difference of the corresponding number of hidden layers. As the number of hidden layers increased, the accuracy of PA_DBN for grassland discrimination showed an upward trend. The highest accuracy rate of PA_DBN for grassland discrimination was 97.41, and the accuracy of PA_DBN decreased when the number of hidden layers was more than four layers. Therefore, the accuracy of DBN does not increase as the number of hidden layers increases indefinitely.
From Tables 2, 3, and 4 and Figs. 4 and 5, there are three groups of experiments in this section. Experiment 1 shows that the training speed of the PA_DBN algorithm is better than that of DBN algorithm under the same order of magnitude. Experiment 2 verifies that the RP algorithm is used to solve the problem of data skew and improve the speed of PA_DBN execution. Experiment 3 verifies that the RC algorithm is used to solve the problem of high automatic cache re-usability without fine-grained data replacement.
In this chapter, we proposed a PA_DBN strategy under Spark to solve some problems existing in the implementation of DBN algorithm on the basis of the theoretical analysis, such as data skew, lack of fine-grained data replacement, and high automatic cache re-usability. These problems lead to the defects of high complexity and low execution time of DBN. The parallel acceleration strategy based on Spark DBN is adopted to solve the problems. The execution efficiency of PA_DBN strategy is improved, and the training sample is solved by a re-partition algorithm. The problem of skew of this set makes the amount of data contained in each partition of RDD more uniform and improves the speed of DBN training. Through the RC algorithm, it can cache the RDDs with high reused frequency in the DBN algorithm. The experiments are conducted to verify the effectiveness of the presented strategy.
Our future work is mainly concentrated on the following aspects: analyze different types of remote sensing resources, design the optimization strategy adapting to the load and type of jobs, and take advantages of another convolutional algorithm to improve the execution efficiency.
The authors would like to thank the reviewers for their thorough reviews and helpful suggestions.
This paper was supported by the National Natural Science Foundation of China under Grant Nos. 61262088, 61462079, and 61562086.
All data are fully available without restriction.
CTY is the main writer of this paper. She proposed the main idea, completed the experiment, and analyzed the result. CYY and ZH gave some important suggestions for this paper. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 John WS, Big data: a revolution that will transform how we live, work, and think, International Journal of Advertising, 2014, 33, 1, 181, 183, 10.2501/IJA-33-1-181-183
- 2 Kambatla K, Kollias G, Kumar V, Trends in big data analytics, Journal of Parallel and Distributed Computing, 2014, 74, 7, 2561, 2573, 10.1016/j.jpdc.2014.01.003
- 3 Chen CLP, Zhang CY, Data-intensive applications, challenges, techniques and technologies: a survey on big data, Information Sciences, 2014, 275, 11, 314, 347, 10.1016/j.ins.2014.01.015
- 4 Zhang L, Zhang L, Du B, Deep learning for remote sensing data: a technical tutorial on the state of the art, IEEE Geoscience & Remote Sensing Magazine, 2016, 4, 2, 22, 40, 10.1109/MGRS.2016.2540798
- 5 Das M, Ghosh SK, Deep-STEP: a deep learning approach for spatiotemporal prediction of remote sensing data, IEEE Geoscience & Remote Sensing Letters, 2016, 13, 12, 1984, 1988, 10.1109/LGRS.2016.2619984
- 6 Hinton G, Deng L, Yu D, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Processing Magazine, 2012, 29, 6, 82, 97, 10.1109/MSP.2012.2205597
- 7 Badrinarayanan V, Kendall A, Cipolla R, SegNet: a deep convolutional encoder-decoder architecture for image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39, 12, 2481, 2495, 10.1109/TPAMI.2016.2644615
- 8 Mou L, Ghamisi P, Zhu X, Deep recurrent neural networks for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing, 2017, 55, 7, 3639, 3655, 10.1109/TGRS.2016.2636241
- 9 Dahl GE, Yu D, Deng L, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20, 1, 30, 42, 10.1109/TASL.2011.2134090
- 10 Chen Y, Zhao X, Jia X, Spectral-spatial classification of hyperspectral data based on deep belief network, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2015, 8, 6, 2381, 2392, 10.1109/JSTARS.2015.2388577
- 11 Fischer A, Igel C, Training restricted Boltzmann machines, Pattern Recognition, 2014, 47, 1, 25, 39, 10.1016/j.patcog.2013.05.025
- 12 Tran VT, Althobiani F, Ball A, An approach to fault diagnosis of reciprocating compressor valves using Teager-Kaiser energy operator and deep belief networks, Expert Systems With Applications, 2014, 41, 9, 4113, 4122, 10.1016/j.eswa.2013.12.026
CNN
Convolutional neural network
DAG
Directed acyclic graph
DBN
Deep belief network
RBM
Restricted Boltzmann machines
RDD
Resilient distributed dataset
RS
Remote sensing
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
By Changtian Ying; Zhen Huang and Changyan Ying