Tracking a micro aerial vehicle (MAV) is challenging because of its small size and swift motion. A new model was developed by combining compact and adaptive search region (SR). The model can accurately and robustly track MAVs with a fast computation speed. A compact SR, which is slightly larger than a target MAV, is less likely to include a distracting background than a large SR; thus, it can accurately track the MAV. Moreover, the compact SR reduces the computation time because tracking can be conducted with a relatively shallow network. An optimal SR to MAV size ratio was obtained in this study. However, this optimal compact SR causes frequent tracking failures in the presence of the dynamic MAV motion. An adaptive SR is proposed to address this problem; it adaptively changes the location and size of the SR based on the size, location, and velocity of the MAV in the SR. The compact SR without adaptive strategy tracks the MAV with an accuracy of 0.613 and a robustness of 0.086, whereas the compact and adaptive SR has an accuracy of 0.811 and a robustness of 1.0. Moreover, online tracking is accomplished within approximately 400 frames per second, which is significantly faster than the real-time speed.
Keywords: visual object tracker; fully convolutional neural network; adaptive search region; truncation prevention; path prediction
Deep convolutional neural networks (CNNs) have significantly improved the accuracy of object recognition from images [[
An object detector is used to estimate the location, size, and class of objects from a single image. It requires both regression and classification. Regression is used to search the location and size of objects in images, and classification determines the type of detected object is. The most recent CNN-based models utilize bounding boxes to determine the location and size of an object [[
Visual object tracking refers to the task of predicting the location of a target object in subsequent images using current and past information [[
In this study, a new object tracker was developed considering the advantages and limitations of previous models. The proposed model is a fully offline trained model for fast computation in online tracking, and an anchor box is excluded from the algorithm for robust tracking. A regression approach similar to R-CNN was adopted because it shows highly accurate estimations for location and size. However, R-CNN-based detectors are computationally slow because thousands of regions are investigated. Thus, only one search region (SR) is considered in this model; the target object is tracked within a single SR with a fully convolutional network (FCNN). An optimal SR for regression is selected because the tracking performance relies heavily on the SR. The scale and location of the SR are determined based on the size and position of the target object and the motion of the object. The proposed algorithm is verified by a tracking test using a micro aerial vehicle (MAV).
The remainder of this paper is organized as follows. Section 2 describes the procedure for determining an adaptive SR, the structure of the FCNN for tracking, and the scale and coordinate conversions needed for the SR-based tracking approach. The data used in this study are introduced in Section 3. Section 4 presents the tracking results in terms of accuracy, robustness, and computational speed of the proposed algorithm. Finally, Section 5 summarizes the study and concludes the paper.
The proposed tracking algorithm comprises four steps, as shown in Figure 1. First, the location and size of the SR are obtained from the tracking results of previous frames. Second, the size of the SR is adjusted to a constant size (100 × 100). Next, the location and size of the object in the SR are estimated using FCNN. Finally, the tracking results of the SR are transformed, and the full image is obtained.
This SR-based tracker can accurately track objects with fast computation. First, the variations in the size and location of the target objects (in images) can increase the tracking error. The error is large if the object appears very small in the images. If the aspect ratio is extremely large or extremely small, accurate tracking becomes challenging. Second, tracking with full images requires a long computation time if the size of the image is large. Thus, an SR-based tracking model was developed and used instead of considering the full image.
The compact SR (CSR) tracker determines the current SR using the last tracking result only. Specifically, the center of the current SR was determined as the center of the object in the previous frame. The width and height of the current SR change depending on the object size in the previous frame. If the size of a target object (in images) remains constant over time, tracking can be very accurate and robust. In this study, the ratio of the SR width to the object width remained constant across several frames. However, this approach does not guarantee accurate and robust tracking. Thus, various strategies and methods have been developed to improve the quality of the SR.
To improve the tracking accuracy and robustness, some functions were included in the CSR model; these were shrink constraints (SC), size optimization (SO), and path prediction (PP).
Shrink constraints: If a portion of a target object is truncated in the SR, this truncation can cause tracking failure. Then, the estimated size becomes smaller than the true size, as depicted in Figure 2. In the CSR model, the SR size was controlled to be linearly proportional to the object size. Thus, if the estimation result is smaller than the true size owing to truncation, as shown in Figure 2b, the size of the SR decreases in the following frame, as shown in Figure 2c. The SR continues to shrink in subsequent frames, which results in tracking failure, as shown in Figure 2d.
The SC of the SR was established to address this problem. First, the size of the SR is constrained to not be smaller than a threshold value. This constraint prevents the SR from shrinking beyond a certain size even if the object is truncated in the SR. Second, the size of the SR is constrained to not shrink if one or more edges of a bounding box are located on the edge of the SR in the previous frame. In the case of a moving object, when the image is cropped in the position of the previous frame, the object deviates from the center and is placed on the side. Finally, when the object is located near the edge of the SR, the shrinkage of the SR is inhibited. These schemes contribute to the robustness of the tracking algorithm.
Size optimization: As previously mentioned, SR size is determined such that the ratio of SR width to object width remains constant across all frames. The accuracy of the FCNN and the failure rate depend on this ratio. Thus, SO was conducted considering the performance of the tracking algorithm.
Path prediction: For the tracking of mobile objects, the location of the SR can be more accurately estimated when the dynamics of the objects are considered. Whereas CSR only considers the previous position of the object for determining the location of the current SR, the location of a trajectory predictive SR is determined based on the previous velocity of the object as well as its previous position. The PP model can determine the current location
(
where
The proposed algorithm detects the object in the SR image, scaled to a 100 × 100 resolution. The FCNN is responsible for predicting a bounding box in the SR; the bounding box can be characterized by four variables,
The images used for training FCNN were prepared by cropping the SR from the full image because FCNN tracks the object in the SR. The cropped regions for training the FCNN were randomly selected to induce robustness to variations in the size and location of the object in the image. Specifically, the ratio of the cropped region scale to the object size was randomly selected from a value between 1.2 and 1.7. The location of the cropped region is determined such that a 90% or higher portion of the target is contained in the cropped region. In addition, image rotation and color variation were also conducted to facilitate data augmentation.
The resulting model was trained using the following loss function.
(
where
Because FCNN conducts tracking in the SR, the tracking results from the SR have to be transformed to coordinates in the full image, as follows:
(
where
In this study, a MAV (DJI Tello) was selected as the target object because of the difficulty in tracking a small object that moves in various background scenes. Images (with 1280 × 720 resolution) of the flying MAV were recorded for training and testing. Figure 3a shows examples of the original dataset, and the bounding boxes of the MAV are shown in Figure 3b. The detection of the MAV in these images is challenging because of its small size and indistinguishable color from the background.
For accurate and robust tracking, the MAV was captured under various circumstances. For example, the background is complex and similar to the MAV in some images. In other images, the MAV is not fully visible because of backlight, or the MAV is blurred due to its rapid movement. The data are composed of three groups. Dataset1 is composed of 3585 images for training FCNN; Dataset2 comprises two sequences (451 and 499 frames) used for SO. This dataset is also used to obtain the coefficients for the PP model. Dataset3 comprises five sequences captured in various environments, whereby individual sequences are composed of 301–360 frames. This dataset is used to validate the performance of the tracking algorithm.
The tracking performance was evaluated in terms of accuracy, robustness, and the expected average overlap curve (EAOC) [[
The tracking accuracy for the kth starting point can be calculated as follows:
(
where
(
where
The EAO curve represents both accuracy and robustness with extended sub-sequences. If tracking fails in the middle of a sub-sequence, the corresponding extended sub-sequence is constructed with the original images of the sub-sequence and dummy frames. Dummy frames are needed to calculate the EAOC, and tracking cannot be performed in dummy frames; that is, the IOU is zero in dummy frames. The number of dummy frames is determined such that the length of the extended sub-sequence is the same as that of the original sequence. If tracking is completed in the final frame without failure in another sub-sequence, the corresponding extended sub-sequence does not contain any dummy frames. The EAOC value can then be calculated as:
(
where
In the following section, the performances of the four trackers are compared. The first model is the CSR tracker; hereinafter, this model is referred to as CSR. The second model is constructed by applying the SC to the CSR model and is referred to as CSR + SC. The third model is created by adding the SO to the second model and is referred to as CSR + SP + SC. The last model includes PP in addition to the third model and is referred to as CSR + SP + SC + PP.
The shrinkage of the SR was observed in the test using CSR. Figure 5a shows the SR when the MAV begins to be truncated at the right edge of the SR. Then, owing to the MAV motion, only half of the MAV is included in the SR, as shown in Figure 5b. Consequently, the size of the SR decreases further, and the tracker almost loses the MAV, as shown in Figure 5c. Tracking failure occurs in a few subsequent frames.
The SC prevents tracking failures caused by undesirable SR shrinkage. Figure 6 shows the SR and predicted bounding box in the absence and presence of SC in the same frame. Although the over-shrink preventive SR considerably increased the IOU value, a small truncation still remained. This limitation was resolved by SO and PP.
The size of the SR must be optimized to improve the tracking performance of the proposed method. If SR is extremely large compared to the size of the target object, other distracting objects can also be contained in the SR. Subsequently, the tracker may track other objects in the SR. If the SR is extremely small, the tracker is likely to lose moving objects. A constant value is selected for the scale ratio of the object and SR. Then, the size of the SR can be determined by multiplying the scale ratio by the object size of the previous frame. Because the object size in the current frame can be estimated after the current SR is determined in the online tracking, the object size in the previous frame should be used to determine the size of the current SR.
The effect of the scale ratio (of SR and target object) on the accuracy was investigated to determine the optimal size, as shown in Figure 7. These results were measured when truncation-preventive SR was applied. The tracking accuracy was at a maximum when the scale ratio is 1.4, 1.5, and 1.6. When the scale ratio was larger than 1.7, the accuracy considerably decreased because several other objects and large areas of the background were within the SR. Although truncation-preventive SR is applied, it cannot prevent all possibilities of truncation when the scale ratio is small. Thus, a rapidly moving object can be truncated in the SR when a small-scale ratio (i.e., 1.2) is used. Owing to this truncation, the accuracy decreases when the scale ratio is small. Inaccurate estimation of the target size results in an inappropriate SR size because the size of the SR is determined by the target size in the previous frame. Considering the accuracy, the optimal scale ratio is determined to be 1.6, which can vary depending on the target object because the value depends on the motion and shape features of the target object.
The PP model was generated by minimizing the location error of the SR in the sequences. To obtain the error, the distance (in pixels) between the center point of the SR and the true center point of the MAV was calculated. Then, the location error was calculated by dividing the distance by the size of the SR. The value of the correction coefficient
The effects of PP were verified using test sequences. PP significantly reduced the SR localization error in most frames, as shown in Figure 8. Furthermore, large errors were observed frequently in the absence of PP; these large errors can cause tracking failure. However, PP treatment reduced the risk of failure by a remarkable amount. Thus, PP prevents the SR from losing the target object, and the robustness can be improved.
The tracking performance of the models was compared with the IOU values of the test sequences, as shown in Figure 9. CSR failed to track across 100 frames in each experiment. Specifically, it failed within 50 frames, except for sequence 1. CSR + SC tracked the MAV for longer than CSR did. However, it also failed at 100–150 frames, indicating the limitation of SC. Although CSR + SC + SO provided robust tracking, its IOU value (i.e., accuracy) fluctuated largely across the frames. PP resolves this unstable tracking. Except for some sections, the IOU of CSR + SC + SO + PP was maintained at a high level (i.e., 0.8–1).
PP can prevent tracking failure caused by the fast motions of the MAV. While the truncation from the SR can be prevented by SC for a slowly flying MAV, SC cannot prevent tracking failure if the MAV flies at a high speed. Specifically, when an MAV is truncated from the SR because of its fast motion, the CSR + SC model can prevent over-shrinkage. However, the error in the bounding box caused by truncation remains, as shown in Figure 5. The CSR + SC model determines the center of the SR in the following frame as the center of the current bounding box. Thus, if the target is stationary or moves slowly, this truncation can be removed. However, if the target moves fast, it is continuously truncated (or vanishes) from the SR, which can lead to tracking failure, as shown in Figure 9d,e. In contrast, when PP is applied, it moves the SR such that its center is close to the midpoint of the moving target. Thus, truncation and subsequent failure are prevented by PP. For example, in Figure 9d, failure did not occur even though the accuracy remained low for some frames.
PP also provides opportunities for recovering high IOU values that decrease due to misinterpretations of the background in previous frames. For example, as shown in Figure 9d, the IOU started to significantly decrease for both the CSR + SC + SO and CSR + SC + SO + MP models at the 114th frame. The background is very similar to the MAV in this frame. While the CSR + SC + SO model suddenly fails after this frame, PP prevents the tracker from failing. Although the IOU value remains low for a long interval owing to the distracting background, PP enables the SR to follow the MAV. Afterward, when the background is changed to be less similar to the MAV, the IOU value increases to a high level (i.e., 0.8–1). A series of images and tracking results of the accuracy recovery function are shown in Figure 10. While the CSR + SC + SO model failed to track from the 112th frame onwards, PP maintained tracking and returned to a high estimation state after 73 frames. A Supplementary Materials (movie file) shows the tracking performance of the CSR + SC + SO + PP model.
The performance of the models was quantitatively compared in terms of accuracy, robustness, and EAOC of the test sequences. The results of the accuracy and robustness are listed in Table 2. CSR exhibited an accuracy of 0.613 and a robustness of 0.086. The accuracy of CSR is satisfactory. However, this accuracy is valid only under easily trackable conditions; in extreme conditions, CSR fails rapidly, and this failure is not considered in the calculations of the accuracy. In contrast, the robustness is low because of frequent truncations and subsequent failures. In the absence of SC and PP, SR can lose the MAV even if it moves slowly.
SC exhibits a significantly improved robustness of 0.583. In contrast, the increase in accuracy by SC is negligible because the truncation of the target in SR cannot be corrected by SC. The optimization of SR size improves both accuracy and robustness. Although PP slightly reduces the accuracy, it significantly increases the robustness. This suggests that PP is necessary for the reliable tracking of dynamically moving targets.
To verify the performance of the proposed model, tracking was also conducted with another recent tracker (i.e., Ocean [[
The effects of SC, SO, and PP can also be evaluated using EAOC. If the EAOC value is high for a frame, accurate tracking is achieved, and failure does not occur in most frames of that sequence. Owing to early failure, the EAOC of the CSR noticeably decreases across the frames, as shown in Figure 11. Although SC and SO have an increased EAOC value, the value is low after 300 frames. In contrast, the EAOC value remains high when PP is applied, which suggests that PP is necessary for long tracking tasks.
Among the four models used in this study, CSR + TP + SO + PP is the most accurate and reliable model for tracking. Thus, the inference time of the model was measured. When GeForce RTX 3090 GPU and Xeon Silver 4215R 3.20 GHz CPU were used for inference, the computation time was approximately 24 ms per frame (i.e., 410 FPS). This computation speed suggests that the model can be used for real-time tracking.
In this study, an SR-based MAV tracker was developed by integrating several methods to improve tracking accuracy, robustness, and computation speed. First, SC reduces the probability of losing the target from the SR. Second, SO optimizes the SR size by considering the effects of the SR size on the tracking accuracy. Finally, PP estimates a more accurate location of the SR by considering the target object motion. Although SC, SO, and PP were conducted with simple and existing operations, the combination of these methods led to significant tracking improvements (i.e., accuracy, robustness, and EAOC). Moreover, the proposed tracker can be applied to real-time tracking because of its high computation speed (410 FPS). Furthermore, the proposed tracker shows higher accuracy and robustness than a tracker named Ocean, which is one of the high-end trackers in the test dataset used in this study.
Although the proposed model exhibits satisfactory tracking performance, additional work is required to address its limitations. First, the proposed tracker was verified using only an MAV. For use in other applications, its performance must be tested with other objects. Second, the proposed model must be updated to manage occlusion periods. When the target size is reduced by occlusion, the SR size can also decrease, which can lead to inaccurate tracking. Thus, an occlusion-detectable model should be developed. In the worst case, the target can be fully covered by other objects. Object detection algorithms can be modified to address this problem. In addition, the tracker should be updated using multiple SRs for multiple-object tracking.
Graph: Figure 1 Procedure of tracking with search region (SR).
Graph: Figure 2 SR shrinkage caused by truncation. The blue box represents the SR, and the yellow box represents the tracking result calculated with a fully convolutional neural network (FCNN). (a–d) show sequential frames.
Graph: Figure 3 Dataset samples of flying MAV in various backgrounds. (a) original images, and (b) same images with annotated bounding boxes of the MAV.
Graph: Figure 4 Starting points and tracking direction for evaluation.
Graph: Figure 5 Shrinkage of SR in the CSR tracker. (a–c) are the SR images in the sequential frames. The white box is the bounding box estimated by the FCNN.
Graph: Figure 6 Effects of SC on the truncation in SRs. (a1–f1) represent the SR of the CSR tracker; (a2–f2) show the SR when truncation preventive SR is applied.
Graph: Figure 7 Effects of scale ratio on accuracy.
Graph: Figure 8 Localization error of SR across frames in the test sequences. (a–e) represent the error of test sequence 1–5, respectively.
Graph: Figure 9 IOU values of the test sequences. Numbers in parenthesis represent the number of frames in the sequences. (a–e) represent the IOU values of test sequence 1–5, respectively.
Graph: Figure 10 Recovery example of PP.
Graph: Figure 11 EAOC of tracking models.
Table 1 Structure of FCNN.
Type Number of CNN Filters Size/Stride Output Size Convolutional layer 16 3 × 3/1 100 × 100 Convolutional layer 32 3 × 3/1 100 × 100 Convolutional layer 16 3 × 3/1 100 × 100 Convolutional layer 32 3 × 3/1 100 × 100 Max pooling layer - 2 × 2/2 50 × 50 Convolutional layer 16 3 × 3/1 50 × 50 Convolutional layer 8 3 × 3/1 50 × 50 Convolutional layer 16 3 × 3/1 50 × 50 Max pooling layer 2 × 2/2 25 × 25 Convolutional layer 32 3 × 3/1 25 × 25 Convolutional layer 64 3 × 3/1 25 × 25 Max pooling layer - 2 × 2/2 12 × 12 Convolutional layer 128 3 × 3/1 12 × 12 Convolutional layer 64 3 × 3/1 12 × 12 Max pooling layer - 2 × 2/2 6 × 6 Convolutional layer 128 3 × 3/1 6 × 6 Convolutional layer 256 3 × 3/1 6 × 6 Convolutional layer 128 3 × 3/1 6 × 6 Max pooling layer - 2 × 2/2 3 × 3 Convolutional layer 4 3 × 3/3 1 × 1
Table 2 Tracking performance comparison.
Method Accuracy Robustness CSR 0.613 0.086 CSR + SC 0.632 0.583 CSR + SC + SO 0.846 0.685 CSR + SC + SO + PP 0.811 1.0 Ocean 0.756 1.0
Conceptualization, W.N.; methodology, W.P.; software, W.P.; validation, W.P.; formal analysis, D.L.; investigation, W.P.; resources, W.P.; data curation, J.Y.; writing—original draft preparation, W.P.; writing—review and editing, W.N.; visualization, W.P.; supervision, W.N.; project administration, W.N.; funding acquisition, W.N. All authors have read and agreed to the published version of the manuscript.
This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1F1A1063463) and the Chung-Ang University Research Scholarship Grants in 2019.
The authors declare no conflict of interest.
The following are available online at https://
By Wooryong Park; Donghee Lee; Junhak Yi and Woochul Nam
Reported by Author; Author; Author; Author