Traffic sign recognition is a key module of autonomous cars and driver assistance systems. Traffic sign detection accuracy and inference time are the two most important parameters. Current methods for traffic sign recognition are very accurate; however, they do not meet the requirement for real-time detection. While some are fast enough for real-time traffic sign detection, they fall short in accuracy. This paper proposes an accuracy improvement in the YOLOv3 network, which is a very fast detection framework. The proposed method contributes to the accurate detection of a small-sized traffic sign in terms of image size and helps to reduce false positives and miss rates. In addition, we propose an anchor frame selection algorithm that helps in achieving the optimal size and scale of the anchor frame. Therefore, the proposed method supports the detection of a small traffic sign with real-time detection. This ultimately helps to achieve an optimal balance between accuracy and inference time. The proposed network is evaluated on two publicly available datasets, namely the German Traffic Sign Detection Benchmark (GTSDB) and the Swedish Traffic Sign dataset (STS), and its performance showed that the proposed approach achieves a decent balance between mAP and inference time.
Keywords: YOLOv3; traffic sign detection; small objects detection; anchor box selection
The recent advancements in technology moved our society towards an intelligent transportation system. This allows humans to delineate the road conditions ahead of time yielding lesser human error and accidents. Today's modern cars incorporate Advanced Driver Assistance Systems (ADAS) such as collision warning, human detection, de-raining systems, and de-hazing systems. Hence, the quality of human daily life is improved. The future commercial autonomous driverless cars or intelligent vehicles are likely equipped with self-localization, scene understanding, path planning, and collision avoidance capabilities. A car that can adjust its speed according to the speed limit sign on-road and navigate to its destination safely is in demand. The prime requirement for such cars is an accurate and real-time traffic sign recognition system.
Typically, the traffic sign recognition system is divided into two steps; (
Classical object feature extraction methods, such as gradients, color, and texture use contour, color, and texture features to locate traffic signs. Nonetheless, these features in the current demand are not promising. Such features change with illumination conditions, and the presence of similar shapes or color objects in the background can cause false detection. Hence, in today's era, researchers are focused on convolutional neural network (CNN) based features. CNN-based detectors include Region Proposals networks such as Fast R-CNN [[
For an autonomous car, inference time is as equally crucial as accuracy; this gives more reaction time, to make timely and appropriate decisions to prevent fatal accidents. A 100% accurate method is useless if it is not able to detect traffic signs in due time. Similarly, a real-time detector with limited accuracy has no importance. Thus, researchers are working to achieve optimal values for accuracy and inference time. This will make traffic sign detection faster and precise. In this paper, we propose an improved single image pass CNN-based YOLOv3 network framework for real-time traffic sign recognition, which is more accurate and faster at detecting traffic signs than the state-of-the-art methods.
The main contributions of this paper focus on the framework of the YOLOv3 algorithm and are summarized as follows:
- Tweaked YOLOv3 model for smaller object detection: YOLOv3 model uses multiple DBL layers for object detection. For larger object size relative to image size, these DBL layers are sufficient however, for a smaller object such as a traffic sign, useless features are being learned. We propose pruning a few of those layers and validating the rationale on GTSDB. This improvement helped in extraction and saving fine details of traffic signs. In addition, a new strategy for training and testing is proposed. Instead of using the whole image at once, it was broken into patches and those patches were used for training and testing. We report an increased accuracy of 14% from the default YOLOv3 accuracy, fewer false detections, and log-average miss rate. We also evaluated the proposed network framework on two publicly available datasets, namely GTSDB (German Traffic Sign Detection Benchmark) and STS (Swedish Traffic Sign) dataset [[
9 ]], giving 16% and 5% rise in mAP, respectively. - Regressive anchor box selection: While analyzing the traffic sign size distribution in the German and Swedish traffic sign training set, we noticed that the majority of the traffic sign sizes are smaller and concentrated in the range from 20 pixels to 40 pixels. The base technique in YOLOv3 uses k-means clustering to select the anchors. We propose to make this selection adaptive using a regression model. We designed a cost function that adds more weight (by assigning higher numbers of clusters) to the bounding box size distribution where a majority of the traffic signs are concentrated. This helps us to select most of the anchors from the pixel value range that contains most of the traffic signs sizes and lesser anchors from the lesser concentration regions. The cost function helps the regression model to adapt the traffic sign sizes for any dataset. As a result, the detection accuracy of the traffic signs on GTSDB was further increased by 2% in addition to the increased accuracy achieved with the tweaked YOLOv3 detector. We noted that the increase in 2% accuracy was due to the perfect placement of anchors on test samples. The proposed model is adaptive and can be used with any object.
- Focal loss: In this research, we also investigated the effect of incorporating focal loss [[
10 ]] as Objectness score. We note that the hyper-parameters in focal loss are object shape as well as size-dependent. The optimal values for alpha and gamma for traffic signs have also been determined. Our proposed method achieved higher mean Average Precision (mAP) and equal Log Average Miss Rate (LAMR) as compared with the Focal loss implemented YOLOv3 detector.
The rest of the paper is organized as follows: Section 2 discusses some recent related works. Section 3 and Section 4 discuss the proposed methodology and experimental results, respectively. Finally, Section 5 concludes the paper.
Traffic sign recognition has been a hot research field for the last decade. It is an essential module for autonomous cars and an automatic driver assistance system (ADAS). Various approaches have been pitched for accurate and real-time traffic sign detection. Since traffic signs follow standard color and shape, prime approaches have been based on color and shape detection. Researchers exploited the use of these features to detect and recognize traffic signs. Gupta and Choudhary [[
The authors in [[
Some research works addressed the aforementioned issues using its standard shape and color as its unique identity. The authors in [[
The recent developments in deep learning proved that CNN is more effective towards traffic sign detection than handcrafted features. CNN-based detectors consist of two categories: two-stage and single-stage detectors. The former detection technique is more robust in detecting traffic signs with higher accuracy than the later one. Jia Li et al. [[
Single-stage detectors determine bounding box coordinates and classification scores simultaneously in a single run hence, improving inference speed for detectors. Single-stage detectors by their architectures were designed to avoid region proposal or sliding windows. Lee and Kim [[
Overall, most of the proposed detectors are either accurate or fast. However, the need of a good trade-off between accuracy and processing time is still required. Since these are crucial parameters of a detection system, which demands further research to investigate an optimal way for precise traffic sign detection in minimum time. Furthermore, accurate detection of a small traffic sign is also important. Traffic signs have standard sizes but their apparent sizes vary with the distance between the camera and the sign. The farther a sign, the smaller it appears; the closer a sign, the bigger it appears. Therefore, for the accurate detection of faraway traffic signs, there is a dire need for appropriate anchor boxes.
YOLOv3 is a state-of-the-art single-pass fast object detection network. It processes a complete image at once by implicitly using contextual information about shape, size, and structure. YOLOv3 network is an amalgam of residual and feature pyramid networks which significantly benefits detection and classification-related tasks. Nonetheless, it still lags in terms of detection accuracy of smaller objects with reference to image size. For COCO dataset [[
Furthermore, the random selection of initial K-means centroids in YOLOv3 sometimes results in good accuracy and at times it is worse. Hence, multiple executions are required to determine the optimal values. Anchor boxes' size and scale have a major effect on Average Precision. Unlike Faster RCNN, it is not trained for a region of interest (ROIs). Rather the network fits the chosen anchor boxes over the objects to be detected in different segments of the image. This anchor box fitting over an object takes a long time with anchor boxes of improper sizes and scales. Besides, sometimes it misses out on true detections and outputs much false detection, leading to lower Average Precision values.
To manage smaller object detection with higher accuracy and lower false detections due to improper bounding box placement, we propose an improved YOLOv3 network along with the anchor box selection method. The block diagram of the proposed method is shown in Figure 1. There are two improvements in the YOLOv3 network; the network layers pruning and replacement of the default anchor box algorithm with the regression-based anchor box selection. In addition, the input image is divided into patches of 400 × 400 pixels and passed through the improved network for traffic sign detection. The output patches are consolidated into the original input image size and redundant detections were removed through a non-maximum suppression block.
As detection accuracy is dependent on the correct localization of an object inside an image, optimal size and shape of anchor boxes are necessary. The designer of YOLOv3 proposed to use the K-means clustering for anchor box selection. This method takes into account the ground truth boxes dimension very well but fails to comprehend the ground truth distribution density. It could be noted that the distribution of ground truth bounding box dimensions are different in all datasets. Some may have a majority of bounding boxes dimensions lesser than 20 pixels or some greater than 60 pixels. A reasonable approach would be to assign majority anchor boxes in the higher concentration range of the bounding box dimension and vice versa to achieve better localization.
To achieve this we analyzed the bounding box dimension distribution of the GTSDB training set as shown in Figure 2. The figure shows bounding box dimensions on the x-axis and the probability of the y-axis. The probability on the y-axis underlines the majority bounding box concentration areas. Furthermore, the figure also shows the fitted cost function on to the bounding box distribution. The cost function can be explained with the following Equations (
(
(
(
where W be the probability-weighted number of clusters, S represents a peak value for a maximum number of clusters,
The fitted cost function on to the bounding box distribution represents the amount of focus our proposed algorithm would have for finding the anchor boxes. It should be noted that higher concentration areas have a higher focus than the lower concentration areas of the bounding box distribution. In the proposed method, we translated the focus as the amount of probability-weighted clusters W to exist and S define a maximum number of clusters that may form. Hence, the nearer we are towards the high concentration area, the more clusters are to be formed and vice versa. Once probability-weighted clusters from Equation (
By conducting experiments, we found that changing the number of elements per cluster also influences the subsequent anchor boxes. We denote this term as Omega and empirically found that when Omega is four, the algorithm yields the highest traffic sign detection accuracy. Finally, we predict the anchors from the linear model by taking random samples in the range from the lowest bounding box dimension to the highest bounding box dimension. Seventy percent of random samples were taken from the higher concentration regions and the remaining thirty percent from the lower concentration region. The split 70–30% was acquired by adding up the probabilities of higher concentration region (from pixel 18 to 50) and lower concentration region (from pixel 51 onwards). In Figure 3, it is observed that for all the values of Omega, Median cluster values in the higher concentration regions are more in number than lower concentration regions.
The nine anchors obtained from the proposed regressive anchor selection scheme give us the detection accuracy of 93.09% on the GTSDB. We note that the proposed method learns the sizes of bounding box dimensions (i.e., minimum and maximum values) from the training set. In addition, it also learns the object dimension i.e., horizontal, vertical, or square object, which enables us to find the most pertinent anchors necessary for object detection and localization.
YOLOv3 is a convolutional neural network that finds objects at three different scales of input image similar to the feature pyramid network. YOLOv3 uses the darknet-53 network as a feature extractor and additional seven convolutional layers at each stage for detection. The output feature map of the deepest level is upsampled at a stride of two and concatenated with a shallower feature map for detecting smaller objects in an image. This upsampling takes place twice in the network to locate different size objects in an image. The YOLOv3 network is shown in Figure 4a, where each convolutional layer is followed by batch normalization and Leaky ReLU activation function represented by the DBL block.
Generally, traffic signs are smaller objects compared with other objects in an image. In the
We reduced the stack of five DBL layers at each detection level to two DBL layers, to make the network shallow as shown in Figure 4b. This reduction of the DBL block stack at each detection stage yielded lesser false positives and lowered the log-average miss rate (LAMR). Henceforth, improving mean Average Precision and overall performance of the YOLOv3 network. Therefore, we conclude that for smaller objects like traffic signs, five DBL layers are redundant in the network, and since traffic signs are small objects in road scenes, the output feature map of deeper networks is not required.
During the experiments, we noted while training and testing the network that the input image size was taken as 416 × 416 pixels. The input image to the network was scaled to 416 × 416 pixels and then it was forwarded to the network. It should be noted that rescaling a large image (e.g., 1360 × 800 pixels image in GTSDB) to the small dimensions of 416 × 416 pixels, makes small objects such as traffic signs very tiny as shown in Figure 5. This results in the loss of discriminative features, hence, it becomes difficult for the trained network to detect such minute objects.
To cope with this, we propose to break the input image into patches. Splitting the image in patches was also proposed in [[
The network was trained with patches of training images of size varying from
As an example, consider an image of size 1360 × 800 pixels image shown in Figure 6a, it has one traffic sign with bounding box annotations [
Similarly, the test images were forward into the network in the form of patches of size
The proposed patch-wise training helped in retaining the fine features of traffic signs, which were lost because of the re-sizing of an input image to a lesser number of pixels. The proposed method improved recall percentage by 20% and subsequent detection accuracy by 13% as compared with the default rescaling.
YOLOv3 computes four-loss functions namely: object confidence loss, classification loss, bounding box centroid, and width-height loss. Bounding box width-height loss is computed using a mean of square errors between predictions and labels. Other loss functions are computed using Binary Cross-Entropy (BCE) loss. In the proposed network, the loss function for confidence score loss is computed using the focal loss function [[
Focal loss function in Equation (
(
(
where
The proposed approach is implemented using Keras with Tensorflow backend using GitHub repository [[
The network training process can be divided into two steps; first, the Darknet53 network was fixed and the rest of the network was trained for ten epochs with a batch size of thirty-two keeping the learning rate to
The proposed network was trained and tested on two different datasets, namely: German Traffic Sign Detection Benchmark (GTSDB) and Swedish Traffic Sign (STS). GTSDB is a widely used data set, containing 600 training images and 300 testing images of size 1360 × 800 pixels. The size of traffic signs in the images ranges from 16 × 16 pixels to 128 × 128 pixels. While the number of traffic signs in an image varies from 0 to 6. The dataset includes traffic signs of three superclasses: danger, mandatory, and prohibitive.
STS is a more complex and bigger data set than GTSDB, with 20,000 images of size 1280 × 960 pixels, among those 20% are annotated. The size of the traffic sign in an image varies from 12 × 12 pixels to 156 × 156 pixels. The dataset was gathered from Swedish highways and cities road with a 1.6-megapixel camera. The dataset can be divided into three superclasses: danger, mandatory, and prohibitive. In experiments, Part0 of Set1 is used as a training set and Part0 of Set2 is used as the test set, considering only visible traffic signs.
Experiments were performed on the default YOLOv3 network and the proposed tweaked YOLOv3 network with two sets of anchor boxes. The first set was obtained from the default YOLOv3 anchor box selection algorithm. And the second set was obtained from the proposed regressive anchor box selection algorithm for the GTSDB dataset. Both sets of anchor boxes were tested on the default and the proposed YOLOv3 network. Here by default YOLOv3 network, we mean the default method of image resizing, and the proposed YOLOv3 network includes pruned network with patch-wise technique. The comparative results are depicted in Table 1. Results show that for the proposed Regressive algorithm the recall percentage and AUC have improved, and the proposed tweaked network model helps in obtaining near to 100% recall percentage for the tuned network as shown in Table 1.
Experiments were performed on the YOLOv3 network using GTSDB dataset for finding the optimal number of layers for traffic sign detection. The effect of removing redundant DBL layers in the network is illustrated in Figure 8, in terms of Mean Average Precision (MAP) and Average Precision for each category. The results follow a Gaussian trajectory; where mAP for the network improves with a reduction of DBL layers and reaches a maximum value of 93.09%, then moving further there is a drop in mAP values. The updated network was evaluated on STS Dataset, resulting 5% rise in mAP than the original model as shown in Figure 8. Figure 9 and Figure 10 illustrate Precision-recall curves of the original and proposed YOLOv3 algorithm for all three classes of GTSDB and STS Dataset, respectively. Here by original YOLOv3 algorithm, we mean default method of image resizing and proposed algorithm includes pruned network with patch-wise technique. For this experiment, regression anchor boxes are used.
Training the network with image patches helped to achieve better mAP and inference time. Resizing a
Table 2 compares the Average Precision results of different state-of-the-art methods for the GTSDB dataset. Among all, our method outperforms in terms of accuracy and inference time. Our proposed network model can detect even the blurred sign beside the visible ones. It also successfully detected small size traffic signs from the test images. Figure 12 and Figure 13 shows some qualitative detection result samples from STS and GTSDB dataset, respectively.
Focal loss is a modified version of Cross-Entropy (CE) loss. It is widely used in object classification problems. CE loss can be used for n number of object classes. For
Experiments prove that the optimal values of alpha and gamma suggested in [[
While keeping alpha constant and increasing the value of hyper-parameter gamma, the mAP decreases as shown in Table 3; hence it can be stated that for traffic sign dataset 0.75 and 1 are optimal values of hyper-parameters alpha and gamma, respectively. By using the obtained hyper-parameter values of alpha and gamma, the network accuracy declined by 0.68%. Therefore, we concluded that the proposed tweaked network with regressive anchor box selection technique outperforms the focal loss adjusted network in the small size traffic sign detection.
This paper addresses the problem of an accurate and real-time traffic sign recognition system. We propose a regressive anchor box selection algorithm that suggests the best-fit anchor set obtained majorly from the higher concentration traffic sign regions of the dataset. The obtained anchors improve precision-recall percentage and thus mean Average Precision of the network. Furthermore, we propose a modified YOLOv3 network, which is faster and more accurate than the state-of-the-art methods. The pruning of higher-level feature map benefits include reducing false positives and lowering log-average miss rate. The proposed network and approach for anchor box determination aids in obtaining a decent balance between accuracy and inference time.
Although focal loss adjustments help in detecting smaller objects in an image, they do not work for our modified network. Our proposed network is robust enough to detect small traffic signs without focal loss adjustment implementation. It also discovered that the optimal values of alpha and gamma for the traffic sign dataset are 0.75 and 1, respectively. The proposed method is evaluated on GTSDB [[
DIAGRAM: Figure 1 Proposed block diagram for small objects (traffic signs) in an image. The dotted rectangle is the proposed technique and the shaded blue blocks are proposed modifications.
Graph: Figure 2 Distribution of traffic sign Bounding Boxes dimension with fitted cost function on the training set of GTSDB (a) Distribution of X Bounding Box dimension with fitted cost function (b) Distribution of Y Bounding Box dimension with fitted cost function.
Graph: Figure 3 Regression models for finding the anchor boxes. (a) regressive model predictions for anchors when omega is 4 (b) regressive model predictions for anchors when omega is 7 (c) regressive model predictions for anchor when omega is 9 (d) regressive model predictions for anchors when omega is 11 (Omega represents number of elements per cluster).
Graph: Figure 4 Network models (a) original YOLOv3 network (b) tweaked YOLOv3 network – modifications are highlighted with blue dotted squares.
Graph: Figure 5 Effect of larger image resizing on traffic sign features.
Graph: Figure 6 Patch-wise detection strategy (a) train image patch (b) test image patches.
MAP: Figure 7 Effect of varying alpha and gamma on mAP (a) when gamma is set to 1 and alpha is varied (b) when gamma is varied keeping optimal value of alpha equal 0.75 constant.
Graph: Figure 8 (a): Effect of network layers on Average Precision for GTSDB dataset. (b): Effect of layers pruning on Average Precision and mean Average Precision for STS dataset.
Graph: Figure 9 Precision-recall curves of STS dataset for Original and proposed YOLOv3 network (a) Danger class (b) Mandatory class (c) Prohibitory class.
Graph: Figure 10 Precision-recall curves of GTSDB dataset for Original and proposed YOLOv3 network (a) Danger class (b) Mandatory class (c) Prohibitory class.
MAP: Figure 11 Effect of training technique on mAP and inference time for GTSDB dataset.
Graph: Figure 12 Detection results on STS Dataset (a) traffic sign recognition in different illumination conditions, (b) Small size traffic sign recognition, (c) A partial view traffic sign recognition, (d) Variable size Traffic Signs in an image.
Graph: Figure 13 Detection results on GTSDB dataset; (a–c) Small size traffic sign recognition, (d) Blurred traffic sign recognition.
Table 1 Comparison of anchor box algorithms and YOLOv3 network model.
Anchor Box Selection Algorithm Evaluation Metric Default YOLOv3 Network Proposed YOLOv3 Network Default-Kmeans Recall percentage 77.23% 96.83% AUC 75.57% 91.82% Ours-Regressive Recall percentage 85.32% 98.13% AUC 78.34% 93.09%
Table 2 Comparison with state-of-the-art methods for GTSDB dataset.
Methods mAP Inference Time SSD + FPN + ITA [ 80.30% - Faster RCNN-Mobilenets [ 84.50% 0.13 s Mask RCNN [ 96.16% 0.32 s Ours 93.09% 0.04 s
Table 3 mAP results for focal loss implementation on GTSDB dataset. Top result in each class are highlighted in bold.
Gamma Alpha Danger Mandatory Prohibitory mAP Mean Lamr 0 0.25, 0.50, 0.75 - - - - - 1.0 0.25 97.92% 57.92% 95.20% 83.68% 0.16 1.0 0.50 97.61% 97.61% 93.62% 91.70% 0.04 1.0 0.75 98.30% 85.01% 93.92% 92.41% 0.01 1.0 0.80 91.17% 82.55% 95.77% 89.83% 0.07 1.0 0.85 97.82% 73.21% 95.50% 88.84% 0.16 1.0 0.99 90.00% 74.22% 94.70% 86.31% 0.09 1.2 0.75 96.75% 70.95% 91.45% 86.38% 0.05 1.5 0.75 93.63% 85.52% 94.96% 91.37% 0.04 Ours 94.49% 90.06% 94.72% 93.09% 0.11
Conceptualization, Y.R., H.A., D.M.S.B. and W.T.T.; Data curation, Y.R., H.A., D.M.S.B. and W.T.T.; Formal analysis, Y.R., H.A., D.M.S.B., W.T.T. and M.M.; Funding acquisition, M.A. and M.M.; Investigation, Y.R., H.A., D.M.S.B. and W.T.T.; Methodology, Y.R., H.A., D.M.S.B. and W.T.T.; Project administration, M.M.; Supervision, M.A.; Validation, Y.R., H.A., D.M.S.B. and M.A.; Visualization, Y.R., H.A., D.M.S.B., M.A. and M.M.; Writing—original draft, Y.R., H.A., D.M.S.B., W.T.T. and M.M.; Writing—review & editing, W.T.T., M.A. and M.M. All authors have read and agreed to the published version of the manuscript.
This research has been financially supported by The Analytical Center for the Government of the Russian Federation (Agreement No. 70-2021-00143 dd. 01.11.2021, IGK 000000D730321P5Q0002). This work was also supported by the Higher Education Commission of Pakistan under the National Research Program for Universities grant number 8348.
Not applicable.
Not applicable.
This research has been conducted on publicly available datasets, namely German Traffic Sign dataset and Swedish Traffic Sign dataset, which can be accessed using following links:
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
BCE Binary cross entropy lamr log-average miss rate mAP mean Average Precision FPPI False Positive Per Image
By Yawar Rehman; Hafsa Amanullah; Dost Muhammad Saqib Bhatti; Waqas Tariq Toor; Muhammad Ahmad and Manuel Mazzara
Reported by Author; Author; Author; Author; Author; Author