Underwater video object detection is a challenging task due to the poor quality of underwater videos, including blurriness and low contrast. In recent years, Yolo series models have been widely applied to underwater video object detection. However, these models perform poorly for blurry and low-contrast underwater videos. Additionally, they fail to account for the contextual relationships between the frame-level results. To address these challenges, we propose a video object detection model named UWV-Yolox. First, the Contrast Limited Adaptive Histogram Equalization method is used to augment the underwater videos. Then, a new CSP_CA module is proposed by adding Coordinate Attention to the backbone of the model to augment the representations of objects of interest. Next, a new loss function is proposed, including regression and jitter loss. Finally, a frame-level optimization module is proposed to optimize the detection results by utilizing the relationship between neighboring frames in videos, improving the video detection performance. To evaluate the performance of our model, We construct experiments on the UVODD dataset built in the paper, and select mAP@0.5 as the evaluation metric. The mAP@0.5 of the UWV-Yolox model reaches 89.0%, which is 3.2% better than the original Yolox model. Furthermore, compared with other object detection models, the UWV-Yolox model has more stable predictions for objects, and our improvements can be flexibly applied to other models.
Keywords: underwater video; object detection; coordinate attention; loss function; frame-level optimization
Underwater video object detection refers to recognizing objects in videos. With the rapid development of deep learning, more and more researchers are applying deep learning to solve the problem of object detection, effectively improving the accuracy and robustness [[
In underwater environments, images are often blurry and the contrast is reduced due to the absorption, reflection, and refraction of light, which ultimately reduces the accuracy of detection. To address these challenges, one approach is to indirectly improve detection performance by enhancing the quality of underwater images [[
Compared to image-level object detection, video-level object detection contains richer temporal and spatial information. Common video object detection methods can be divided into four categories [[
Most existing research on object detection focuses on non-underwater scenes or underwater image object detection. However, existing object detection models have limited detection capabilities for underwater videos with blurry images and low contrast. For objects with blur and low contrast, it is difficult for the model to obtain their detailed features, so they are often undetected or detected as other categories, which reduces the accuracy and recall of detection. In addition, they cannot also utilize the relationships between adjacent frames in the videos. The detection results of the same object between adjacent frames often have large differences in confidence and experience a serious jitter phenomenon, which affects the overall effect of video detection. As a result, their detection performance is unsatisfactory for underwater videos. Therefore, we propose an improved Yolox video object detection model named UWV-Yolox.
Considering the low contrast of the underwater video, we believe that enhancing the contrast of the input is conducive to recovering the feature of the object, so we first use the Contrast Limited Adaptive Histogram Equalization method to enhance the contrast of the input. Second, our intuition is that augmenting the representations of the objects of interest in the model is beneficial for learning more feature information about the object. Therefore, we integrate the coordinate attention module [[
Our main contributions are as follows:
- We use the Contrast Limited Adaptive Histogram Equalization method to enhance the contrast of underwater videos, making the model more suitable for underwater detection. We also propose a new CSP_CA module by integrating Coordinate Attention to augment the representations of the objects of interest;
- We propose a new loss function, which includes classification loss, regression loss, confidence loss, and jitter loss, to improve the model's performance in video detection;
- We propose a frame-level optimization module, which uses tubelet linking, re-scoring, and re-coordinating to optimize the model's results in video object detection.
The rest of this paper is organized as follows. Section 2 shows the research related to underwater video object detection. Section 3 introduces the methods of the proposed UWV-Yolox model. Section 4 presents the experiments conducted on the UVODD dataset built in this paper to validate the effectiveness of the proposed underwater object detection model. Section 5 provides the conclusion of this paper.
One way to improve detection performance is to enhance the quality of the underwater images. Karel et al. proposed Adaptive Histogram Equalization (AHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) for contrast enhancement, respectively [[
The above are traditional image augmentation methods. In recent years, deep learning has been applied to image augmentation. Li et al. used Convolution Neural Network (CNN) to construct an image dehazing model in 2017, which can generate augmented images end-to-end [[
Another way to optimize the detection performance is to improve the object detection model. In 2021, Zhang et al. proposed a Yolov4-based lightweight underwater object detection method and multi-scale attentional feature fusion module to trade off accuracy and speed [[
These image-level object detection methods have improved detection accuracy, but they only consider the information from single images. For video object detection, there is still rich spatiotemporal information that has not been fully utilized.
In 2016, Kang et al. proposed a complete framework for the VID task based on image-based object detection and object tracking. They utilized a temporal convolution network to optimize the detection results [[
These video object detection methods have achieved great results in conventional detection tasks, but in underwater scenes, they still cannot reach a reasonable detection result.
The structure of the UWV-Yolox model is shown in Figure 1. It can be divided into five modules: input, backbone, neck, prediction, and a frame-level optimization module. In the input module, we add Contrast Limited Adaptive Histogram Equalization to enhance the input videos. We propose the CSP_CA module to replace the CSP module in the backbone module by integrating a Coordinate Attention module. In the prediction module, we propose jitter loss and regression loss. Finally, we add a frame-level optimization module to optimize the detection results.
In underwater environments, the contrast of the underwater videos is usually low, which causes interference with the feature information of the objects. To solve this problem, we recover more feature information in the underwater video by enhancing the contrast to improve the detection effect. Contrast Limited Adaptive Histogram Equalization (CLAHE) is an adaptive histogram equalization method, which divides the input into small blocks and then performs limited histogram equalization. Blocking and limiting contrast before equalization can avoid overexposed areas and suppress noise. Therefore, we use the CLAHE method to enhance the contrast of the underwater video.
Contrast Limited Adaptive Histogram Equalization (CLAHE) improves Histogram Equalization (HE). For a grayscale image
(
The total number of gray values is denoted by L (usually 256), and n represents the total number of pixels in the image. Therefore, the cumulative distribution function of the gray level can be expressed as Equation (
(
Then, through the transformation of
(
The relationship between the cumulative distribution function of the original gray image and the enhanced image is shown in Equation (
(
The function
(
In this way, the grayscale image
CLAHE first divides the input image into several non-overlapping blocks of equal size. It then calculates the histogram of each sub-block and computes the clipping threshold. After pixel redistribution, it performs Histogram Equalization. Finally, it reconstructs the grayscale values of the pixels. The clipping threshold prevents noise amplification in regions that are almost constant—but with noise. It limits the slope of the transformation function by clipping the histogram to a predefined value before computing
While applying CLAHE to a video, the video is first extracted into several frames, then each frame is separated into three color channels: Red, Green, and Blue (RGB). Next, CLAHE is applied separately to each color channel to enhance the contrast. Finally, we convert the augmented data back into a video with enhanced contrast. The entire process is shown in Figure 2.
Blurring is inevitable in underwater videos, which makes it harder for the model to accurately detect objects. In this paper, we propose an improved CSP module named CSP_CA by integrating a Coordinate Attention (CA) module into the CSP module. It improves the detection performance on blurry underwater videos.
As shown in Figure 3, the CA module can be divided into two steps: coordinate information embedding and coordinate attention generation. Coordinate information embedding is performed first. It does not use two-dimensional global pooling to compress the feature tensor into a single feature tensor, which would cause the loss of positional information. The CA module decomposes both horizontal and vertical global pooling into one-dimensional feature encoding operations. So, it allows the model to capture long-term dependencies between channels along one direction and preserves object positional information along the other direction to help the network more accurately locate the interesting objects. Then, the CA module performs coordinate attention generation. The two cascaded feature maps are transformed using a shared
In the UWV-Yolox model, we propose the CSP_CA module by integrating the CA module into the CSP module. The structure of the CSP module is shown in Figure 4a. It divides the input feature map into the trunk branch and the residual branch. The residual branch directly performs convolution to extract features, while the trunk branch convolves first, and then goes through multiple Res Unit structures. Due to the different structures of the Res Unit, the CSP module includes CSP1 and CSP2 structures. The Res Unit structure in the CSP1 module adopts a residual structure. It extracts features through the residual structure of two consecutive convolutional layers. The Res Unit structure in the CSP2 module is directly composed of two consecutive convolutional layers. The two branches are then merged to output the feature map through concat operation.
There are two structures of the CSP_CA module, namely CSP_CA1 and CSP_CA2, as shown in Figure 4b,c. For the Res Unit in the original CSP that contains the residual structure, it is replaced by the CSP_CA1 module. A CA module is added between the CBS module and the Res Unit module in its trunk branch, which enhances the ability to extract blurry features. The Res Unit module in the original CSP module that does not contain the residual structure is replaced by the CSP_CA2 module. The Res Unit module is replaced by the CA module in its trunk branch, reducing the number of parameters of the model [[
The loss function of the UWV-Yolox model includes classification loss, regression loss, confidence loss, and jitter loss, which is defined as follows:
(
Both the classification loss and the confidence loss use Binary CrossEntropy (BCE) loss, as shown in Equation (
(
In object detection tasks, bounding boxes are used to locate the objects. To evaluate the quality of the predicted boxes, the closeness between the predicted box and the ground truth box is calculated, which is also known as the regression loss. The IOU loss is commonly used as regression loss. The IOU between two detection boxes A and B is defined as the intersection over the union of their areas, as shown in Equation (
(
(
However, IOU only considers the relationship between the area of the predicted bounding box and the ground truth bounding box, ignoring the information about their position and shape. Therefore, only using IOU to calculate the regression loss is not accurate enough. In order to more comprehensively utilize the relationship between the predicted bounding box and the ground truth bounding box, a new regression loss function
(
(
The size differences of the detected bounding boxes are relatively large when taking into account the different sizes of underwater objects. It makes a significant difference in the value of the IOU. In order to make the IOU loss more stable in this situation, the logarithm of the area is taken before calculating IOU. At the same time, in order to avoid the situation where the value of the area is too small so the logarithm result becomes negative and the size difference becomes large,
(
(
The larger the value
(
In video object detection, different frames in a video are individually detected. However, the detector's accuracy varies for each frame, so the predicted bounding boxes of adjacent frames may vary significantly, which is not consistent with the fact that the bounding boxes of adjacent frames change in a certain pattern. This phenomenon is called jitter. It not only reduces the accuracy of video object detection but also affects the visualization of the detection results. To reduce the jitters, the UWV-Yolox model adds a new jitter loss function. The physical concepts of velocity and acceleration are incorporated to define the velocity and acceleration of the bounding boxes. Velocity is defined as the difference between the bounding box coordinates of adjacent frames, as shown in Equation (
(
(
The jitters of the bounding boxes are mainly reflected in the irregular velocities of the bounding box coordinates between adjacent frames, which means that the acceleration of the bounding boxes is different. When detecting the same object in adjacent frames, the object originally changes at a certain acceleration. However, due to the predicted deviation of the object detection in one frame, the acceleration of the object changes. By suppressing the change of acceleration to reduce thhe detection errors, the jitters can be suppressed Therefore, the jitter loss mainly consists of acceleration loss, which calculates the difference between the predicted box acceleration and the ground truth box acceleration between adjacent frames, as shown in Equation (
(
T refers to the batch size, and
In video object detection, there may be significant differences in the detection results for the same object between adjacent frames, resulting in sudden drops in confidence and bounding box jitters. Therefore, the UWV-Yolox model adds a frame-level optimization module, including tablet linking, re-scoring, and re-coordinating. The effect of the frame-level optimization module is shown in Figure 5.
The tubelet linking aims to link the detected bounding boxes of the same object across adjacent frames. The good detection results can therefore optimize the poor detection results in the same tubelet. To determine whether two detection results belong to the same object, a similarity function between the two objects needs to be defined. In this frame-level optimization module, the Intersection over Union (IOU) between two bounding boxes, the distance between the center of the bounding boxes
(
After obtaining all paired detection objects between two frames, the tubelet linking process begins. The tubelet is built from the first frame, and if the corresponding pair of objects still exist in the following frames, the tubelet is extended. The tubelet can be extended until there is no paired detection object with the tubelet object in the next frame. The new tubelet can be initialized in any frame. For each pair of frames, if the paired objects do not belong to any existing tubelet, a new tubelet can be created.
The detection results are then optimized based on the linking tubelets. First, the bounding boxes are re-scored in the same tubelets, using good detection results to optimize the poor detection results. For the same object appearing in the adjacent frames, the predicted results should have the same class confidence no matter how large the shape change is. If the predicted object has low class confidence and belongs to a tubelet in which one other object has high class confidence, we could believe that the predicted object should have higher class confidence. So when re-scoring, the average classification confidence score of all detection objects in each tubelet is calculated and assigned to all objects in the tubelet, as shown in Equation (
(
Then, the coordinates of the bounding boxes are re-coordinated in the tubelets. Usually the movement of the object has a specific pattern and does not suddenly jitter, so the predicted bounding boxes in the adjacent frames usually change smoothly instead of changing drastically. We utilize this point to optimize the detected bounding boxes. Bounding box re-coordinating treats the four coordinates of the detection bounding boxes in a tubelet as time series with noise. A one-dimensional Gaussian filter is constructed along each time series, and the filter convolves a Gaussian function with each coordinate of the bounding boxes in the same tubelet. The sum of the convolution operation is used as the new coordinates of the detection objects in the tubelets, as shown in Equation (
(
(
Re-coordinating applies a Gaussian filter to re-coordinate the coordinates of the same object in adjacent frames, and then use the new coordinates sequence as the bounding box coordinates of the detected objects in the tubelets. It reduces the bounding box jitters in video object detection. This effect can be seen in Figure 6. After re-coordinating, sudden changes of the abnormal coordinate values have been reduced and the overall variation of the bounding box coordinates of the same object becomes smoother as it suppresses the jitters.
The experimental platform used in this paper is a GPU server with the Ubuntu operating system, with the following main hardware configuration: Intel(R) Xeon(R) Platinum 8338C CPU and RTX 3090 (24GB) GPU. The UWV-Yolox model is implemented using the deep learning framework Pytorch 1.10.1 and Python 3.8. In the experiments, the training input image size is set to 640 × 640, and the test input size is set to 576 × 576. The batch size is 16, and SGD [[
The existing publicly available datasets are either image datasets or just a few numbers of videos, which are not suitable for training and testing the UWV-Yolox models in the paper. Considering that there is currently no universal publicly available underwater video dataset, we select suitable videos from the collected underwater video datasets to build a universal underwater video object detection dataset (UVODD).
By collecting and screening publicly available underwater datasets, three datasets with high data quality are selected, namely Brackish [[
In terms of class consistency, considering the classes in publicly available underwater datasets, five common classes are selected as the UVODD dataset classes: echinus, holothurian, starfish, fish, and jellyfish. In terms of annotation format consistency, the ImageNet VID dataset [[
In order to objectively evaluate the performance of the model in underwater video object detection, we use the mAP@0.5 and FPS to validate. FPS refers to frames per second, which reflects the detection speed. mAP@0.5 refers to the mean average precision at an IOU threshold of 0.5, which is defined as follows:
(
Here, n refers to the number of classes and AP refers to the average precision for a single class. Each class can calculate its precision and recall for an IOU threshold. Thus, the P-R curve can be obtained, and the AP value is the area under the curve, as shown in Equation (
(
(
(
TP represents the number of correctly detected positive samples, FP represents the number of incorrectly detected negative samples, and FN represents the number of incorrectly detected positive samples.
Before object detection, contrast enhancement processing with CLAHE is applied to the input videos. To verify the effect of the CLAHE, we conduct experiments to compare CLAHE with several data augmentation methods. Considering that there are no paired original images and augmented images in the dataset, methods with supervised learning are not possible. Therefore, we select traditional data augmentation methods and GAN-based data augmentation methods for experiments. Figure 7 shows the augmented videos of different data augmentation methods. The fusion augmentation method makes the videos clearer but causes significant color deviation for the objects in the videos. The Cycle-SNSPGAN method does not have an obvious dehazing effect on the videos. The RGSH method corrects the contrast and color of the videos but some parts are excessively enhanced, causing an exposure phenomenon. Overall, the CLAHE method is the best augmentation method for underwater videos.
Table 2 shows the detection results of the detection model when different augmentation methods are applied to it on the UVODD dataset. It shows that the CLAHE method improves the mAP@0.5 of underwater video object detection by 1.2% compared with the result without augmentation, while other data augmentation methods do not significantly improve or even weaken the detection performance. It indicates the effectiveness of using the CLAHE contrast enhancement method to augment underwater videos.
This paper proposes the UWV-Yolox model with five improvements. To verify the meaningfulness of each improvement and its ability to improve the performance of underwater video object detection, we conduct the experiments. Each improvement is individually added to the original UWV-Yolox model, and we then train and test the model on the UVODD dataset. The results are shown in Table 3.
It can be seen that each improvement improves the performance of the UWV-Yolox model, indicating their effectiveness in improving the model's ability to detect objects in underwater videos. Among all, the jitter loss has a particularly significant effect on detection. Jitter is a common phenomenon in video object detection, and weakening jitters through jitter loss is beneficial for improving video detection performance. The CLAHE method applied to enhance contrast in input videos and the frame-level optimization module applied to optimize detection results increase the detection mAP@0.5 by at least 1.2%. The addition of a CA mechanism to the model's backbone also improves the model's ability to detect objects, increasing the detection mAP@0.5 by 0.7%. The replacement of regression loss is not significantly improving the performance, as it is only 0.2% higher than the baseline. It indicates that the overlapping area plays a major role in evaluating the proximity between predicted and ground truth boxes.
We conduct ablation experiments to explore the effectiveness of augmenting videos using CLAHE, adding a CA mechanism, adding jitter loss, replacing regression loss, and adding a frame-level optimization module in the UWV-Yolox model. These improvements are tested separately in six models trained and tested on the same UVODD dataset, as shown in Table 4.
From Table 4, it can be seen that each improvement in the UWV-Yolox model improves the detection results. Considering that the input videos interfered with by the underwater environment are blurry and of low contrast, the underwater videos are first processed with CLAHE to enhance the contrast. It makes mAP@0.5 improved by 1.2%, indicating that enhancing the contrast improves the clarity of the object features in the videos. Next, the CA module is added to the feature extraction backbone network to augment the representations of the objects of interest, and the mAP@0.5 is 0.5% higher than when only augmenting the videos. Additionally, since jitters are inevitable during the detection of bounding boxes in videos, the jitter loss is added to the loss function, improving by 0.5% for detection. It indicates that jitters are a common occurrence in videos and an important optimization point for video detection. Subsequently, the IOU regression loss is replaced with
We compare the experimental results of adding each improvement individually and the ablation experiments. When comparing the result of directly adding jitter loss with adding jitter loss after data augmentation and adding the CA module, we found that the improvement of mAP@0.5 decreased from 2.4% to 0.5%. This indicates that data augmentation and the CA module lead to improvements in underwater video object detection and reduce the effectiveness of the jitter phenomenon as well. The same as the frame-level optimization. However, replacing the regression loss improved the detection mAP@0.5 from 0.2% in the separation experiment to 0.8% in the ablation experiment. It suggests that, after reducing the effectiveness of jitter, the overlapping area is not sufficient to evaluate the proximity between the predicted and ground truth boxes, so the influence of position and shape is also significant.
To verify the effectiveness of the proposed UWV-Yolox model in underwater video object detection, we conduct some comparative experiments to compare the UWV-Yolox model with other detection models on the UVODD dataset. The results are shown in Table 5.
From the table, it can be seen that the mAP@0.5 of the UWV-Yolox model reaches 89.0%, the highest among all models, which indicates that the model has the best detection performance for underwater videos. Compared to the TransVOD_Lite video detection model, the UWV-Yolox model has higher mAP@0.5 and faster speed. This indicates that the UWV-Yolox model performs better in detecting underwater videos with low contrast and blurriness. As for the Boosting R-CNN, an underwater detection model, the UWV-Yolox model increases the mAP@0.5 by 11.6% and fastens the detection speed. It shows that the UWV-Yolox model performs better for underwater scenes and makes full use of the video information. The UWV Yolox model improves the mAP@0.5 by 6.7% compared to the Yolov5 model but slows down the detection speed. The process of augmenting underwater videos and the optimization of detection results using frame relationships in the UWV-Yolox model benefit for detecting objects in the underwater video but also consume more detection time, which is not conducive to real-time detection. The YOLOV model learns the spatiotemporal relationships in videos during model training, but its generalization ability for underwater scenes is weak. The UWV-Yolox model not only utilizes video information but also improves the detection of underwater scenes, so it increases the mAP@0.5 by 4%, requiring more detection time. Compared to the Yolox model, the UWV-Yolox model improves the mAP@0.5 by 3.2% with close detection speed. While ensuring detection speed, It improves the detection performance for underwater videos without slowing down the detection speed.
Figure 8 shows the detection results of various detection models for underwater video object detection. The red ellipses highlight the superior performance of the UWV-Yolox model compared with the detection results with the black ellipses. It can be found that most of the confidence of objects detected by the UWV-Yolox model is higher than the confidence detected by other models. A few objects show a slight decrease in confidence due to frame-level optimization, which uses high-confidence objects to optimize the same object with low confidence in the adjacent frames. Moreover, the UWV-Yolox model can detect objects that cannot be detected by other models, such as partially occluded objects, incompletely shaped objects, and blurry objects. This indicates that augmenting the underwater video and integrating the CA module to utilize the position relationship of blurry objects in the UWV-Yolox model is beneficial for improving performance in underwater videos. Compared to other models, the UWV-Yolox model also reduces wrong object detection, such as detecting one object as multiple objects. In addition, the predicted bounding boxes of the UWV-Yolox model more accurately locate the objects, and the changes of bounding boxes between adjacent frames are smoother, suppressing the jitters.
Overall, the UWV-Yolox model improves the performance of object detection in underwater videos. It improves the classification confidence of most objects. It also enhances the detection ability for occluded, incomplete, and blurry objects. Furthermore, the UWV-Yolox model makes the predicted bounding boxes closer to the ground truth bounding boxes and smoothens the changes of the bounding boxes between adjacent frames, suppressing the jitters. Although the detection speed of the UWV-Yolox model is not fast, it has reached the level of real-time detection. However, the frame-level optimization module reduces the confidence of some high-confidence objects by averaging the confidence of the same tubelets to optimize the low-confidence objects. Additionally, the augmentation of underwater video and the frame-level optimization module slow down the detection speed.
We propose the UWV-Yolox model to address the issue of underwater video object detection. In the model, we use the Contrast Limited Adaptive Histogram Equalization method to enhance video contrast. In the backbone for feature extraction, we propose a new CSP_CA module by adding a Coordinate Attention module to the CSP module. As for the loss function, we add jitter loss to optimize the prediction of object detection boxes in the design of the loss function. Frame-level optimization module is also added to optimize the detection results. By conducting experiments, we validate the effectiveness of the proposed UWV-Yolox model. On the UVODD dataset built in the paper, the mAP@0.5 of the UWV-Yolox model reaches 89.0%, which is 3.2% better than the original Yolox model.
UWV-Yolox provides a new perspective for solving the task of underwater video object detection, and the improvements of the UWV-Yolox can be flexibly applied to other models to improve the detection results. The UWV-Yolox model will soon also be applied to an eco-environment research program for the Environmental Protection Agency to explore underwater biodiversity and environmental stability in rivers.
In the future, we plan to continue cooperating with the Environmental Protection Agency to obtain more underwater video data to increase the dataset required for training. Furthermore, future research will make more in-depth lightweight improvements on UWV-Yolox, thus fastening the detection speed and reducing the demand for hardware configuration to expand the application scenarios of the model.
Graph: Figure 1 UWV-Yolox structure.
Graph: Figure 2 Process of the CLAHE method.
Graph: Figure 3 Coordinate attention module.
Graph: Figure 4 CSP_CA module and CSP module.
Graph: Figure 5 Effect of the frame-level optimization module.
Graph: Figure 6 Results of re-coordinating. The red circle refers to the coordinates with significant differences before and after re-coordinating.
Graph: Figure 7 Results of video augmentation.
Graph: Figure 8 The detection results of different underwater video object detection models.
Table 1 UVODD dataset.
The Number of Videos The Number of Images The Number of Objects Train 59 6886 15,141 Test 15 1773 5930 Sum 74 8659 21,071
Table 2 Results with different augmentation methods.
Model Augmentation Methods mAP@0.5(%) Yolox None 85.8 Improved Yolox Fusion [ 81.4 (−4.4) Improved Yolox Cycle-SNSPGAN [ 82.9 (−2.9) Improved Yolox RGSH 86.0 (+0.2)
Table 3 Results of the original UWV-Yolox model with individual improvement.
Methods mAP@0.5(%) Parameters baseline 85.8 99.00 M + Regression loss 86.0 (+0.2) 99.00 M + Coordinate Attention 86.5 (+0.7) 82.66 M + CLAHE 87.0 (+1.2) 99.00 M + Frame-level optimization 87.3 (+1.5) 99.00 M + Jitter loss 88.2 (+2.4) 99.00 M
Table 4 Results of the ablation experiments.
CLAHE CA Module Jitter Loss Regression Loss Frame-Level Optimization mAP@0.5(%) 85.8 ✓ 87.0 (+1.2) ✓ ✓ 87.5 (+0.5) ✓ ✓ ✓ 88.0 (+0.5) ✓ ✓ ✓ ✓ 88.8 (+0.8) ✓ ✓ ✓ ✓ ✓ 89.0 (+0.2)
Table 5 Results of different detection models.
Model Backbone Input Size Batch Size mAP@0.5(%) FPS TransVOD_Lite [ ResNet-101 600 × 600 4 69.0 14.9 Boosting R-CNN [ ResNet-50 1333 × 800 4 77.4 25.4 Yolov5 CSPDarknet 640 × 640 32 82.3 156.2 YOLOV [ CSPDarknet 640 × 640 32 85.0 104.1 Yolox CSPDarknet 640 × 640 16 85.8 75.6
Conceptualization, J.L.; methodology, H.P., J.L., H.W., Y.L., M.M. and D.Z.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, H.P. and J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, H.P., H.W., Y.L., M.Z. and X.Z.; visualization, J.L.; supervision, H.P.; project administration, H.P. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
Not applicable.
The authors declare no conflict of interest.
By Haixia Pan; Jiahua Lan; Hongqiang Wang; Yanan Li; Meng Zhang; Mojie Ma; Dongdong Zhang and Xiaoran Zhao
Reported by Author; Author; Author; Author; Author; Author; Author; Author