In the domain of remote sensing research, the extraction of roads from high-resolution imagery remains a formidable challenge. In this paper, we introduce an advanced architecture called PCCAU-Net, which integrates Pyramid Pathway Input, CoordConv convolution, and Dual-Inut Cross Attention (DCA) modules for optimized performance. Initially, the Pyramid Pathway Input equips the model to identify features at multiple scales, markedly enhancing its ability to discriminate between roads and other background elements. Secondly, by adopting CoordConv convolutional layers, the model achieves heightened accuracy in road recognition and extraction against complex backdrops. Moreover, the DCA module serves dual purposes: it is employed at the encoder stage to efficiently consolidate feature maps across scales, thereby fortifying the model's road detection capabilities while mitigating false positives. In the skip connection stages, the DCA module further refines the continuity and accuracy of the features. Extensive empirical evaluation substantiates that PCCAU-Net significantly outperforms existing state-of-the-art techniques on multiple benchmarks, including precision, recall, and Intersection-over-Union(IoU). Consequently, PCCAU-Net not only represents a considerable advancement in road extraction research, but also demonstrates vast potential for broader applications, such as urban planning and traffic analytics.
Keywords: road extraction; pyramid pathway input; CoordConv; Dual-Input Cross Attention (DCA) Modules
In the swiftly advancing field of remote sensing technology, we now have unprecedented access to high-resolution images of the Earth's surface, offering invaluable insights into geography [[
Hybrid networks like ResU-Net have further combined the advantages of U-Net and Res-Net, stabilizing gradients while facilitating information exchange through shortcut connections [[
To surmount the persistent challenges associated with road extraction in high-resolution remote sensing imagery, researchers have devised a plethora of deep learning methodologies tailored specifically to remote sensing applications. Among these, the structurally refined PCCAU-Net stands as a recent breakthrough, incorporating Pyramid Pathway Input, CoordConv convolution, and Dual Input Cross Attention (DCA) modules. This innovative amalgamation facilitates the more efficacious capture and integration of multi-scale features, thereby augmenting both the accuracy and robustness of road extraction.
Distinctive contributions and attributes of our model are as follows:
- Multi-scale feature and spatial context fusion: PCCAU-Net employs Pyramid Pathway Input and CoordConv convolution, achieving a seamless assimilation of features from across multiple scales down to fine-grain levels, along with precocious spatial context recognition. This holistic design strategy assures precise road localization and extraction, even amidst complex settings.
- At the critical encoder stage, the model incorporates a Dual-Input Cross Attention (DCA) module. This not only augments the nimble integration of feature maps across various scales but also sharpens the model's focus on road center detection, effectively attenuating false positives induced by obstructions, entanglements, and noise.
- Across key performance metrics such as precision, recall, and Intersection-over-Union (IoU), PCCAU-Net demonstrates a clear supremacy over existing mainstream approaches, vindicating its potential applicability in the realm of remote sensing data processing.
This manuscript elucidates the design principles, empirical validations, and potential applications of this model, aiming to offer substantive insights for future endeavors in remote sensing data processing. The article is structured as follows: The introduction delineates the backdrop and challenges of road extraction in high-resolution remote sensing imagery and elucidates why traditional deep learning models may encounter difficulties. Subsequently, we present our solution, an innovative deep learning model named PCCAU-Net, endowed with Pyramid Pathway Input and Dual Input Cross Attention (DCA) modules. In the Methods section, we provide a comprehensive description of the PCCAU-Net architecture and its operational mechanics, including its cornerstone components like CoordConv and DCA modules. In the Results section, we introduce the two road datasets selected for the experiment, Massachusetts Roads Dataset and DeepGlobe Road Dataset, which provide rich image data for road-related computer vision tasks. Secondly, through comparative analysis and ablation experiments, various performance indicators are compared to demonstrate the performance of our model on these two primary remote sensing datasets, confirming its effectiveness and superiority. The Discussion section provides a detailed review of the experimental results and analyzes the advantages and potential limitations of PCCAU-Net. Finally, the conclusion summarizes the key contributions and discoveries of this study, pointing to directions for future research and potential improvements.
In the realm of high-resolution remote sensing, the task of road extraction persistently presents a challenging conundrum. To mitigate this, we elaborate upon a novel and architecturally refined deep learning model, designated as PCCAU-Net. This model amalgamates a variety of cutting-edge techniques with the intent of capturing the intricate spatial contextual relationships inherent in high-resolution remote sensing imagery, thereby augmenting the accuracy of road extraction.
Designed explicitly to address the intrinsic complexities of high-resolution remote sensing imagery, PCCAU-Net is a deep learning model that features a unique multi-scale, multi-tiered strategy. Traditional deep learning approaches often encounter difficulties when confronted with diverse terrains, such as urban and rural landscapes [[
In summary, PCCAU-Net not only surmounts the limitations inherent to traditional methods but also heralds a new era in the processing of high-resolution remote sensing imagery, offering an innovative and exceptionally effective solution. The PCCAU-Net network structure is shown in Figure 1.
While the traditional U-Net architecture is undeniably robust, it often compromises spatial granularity through its encoding stages, composed of serial convolutions, and pooling layers in order to facilitate abstract semantic features [[
CoordConv is an innovative convolutional variant that integrates spatial coordinates as additional channels, facilitating the direct interpretation of feature locations within a network [[
In the intricate realm of remote sensing image analysis, developing ways of deftly and efficiently amalgamating features from disparate sources has become a cardinal research challenge. At the heart of PCCAU-Net is the Dual-Input Cross Attention (DCA) module, which answers this conundrum by adroitly harmonizing the strengths of deep learning and attention mechanisms. During the encoding phase, this module focuses on extracting and integrating features from two heterogenous pyramid pathways, ensuring a cohesive amalgamation of deep features. Even more critically, at the decoding phase, the DCA module serves as a skip connection, elegantly bridging deep and shallow features, and thereby achieving spatial and semantic coherence between them. The architecture of the DCA module is depicted in Figure 3.
The intellectual underpinnings of the DCA module can be summarized as follows:
- Feature encoding and interaction: the DCA module initially performs specialized convolution operations on both input features, generating query (Q), key (K), and value (V) representations. Ingeniously concatenated, these features complement and interact with each other in the subsequent attention stage.
- Cross- Attention fusion: the module dynamically determines the importance of each feature by computing weights based on the juxtaposed Q and K. This process ensures that DCA can pinpoint and integrate salient information from varied origins.
- Feature harmonization and dynamic fusion: the resultant features are first optimized via residual connections and subsequently subjected to layer normalization to ensure the continuity and stability of the information flow. Ultimately, the fusion attention mechanism calculates fusion weights and meticulously integrates Out_Input1 and Out_Input2, guaranteeing seamless spatial and semantic alignment.
In summary, the DCA module endows PCCAU-Net with an unprecedented capability for the effective interaction and integration of features across multiple scales and origins within a unified framework. This design ethos not only augments the model's feature representation capabilities but also brings about hitherto unmatched precision and efficiency in complex remote sensing image analysis tasks.
In the manifold applications of deep learning, particularly within the Dual-Input Cross Attention (DCA) module of the PCCAU-Net architecture, the nuanced and efficient fusion of multi-source information emerges as a critical determinant of model performance. To this end, we introduce a fusion attention mechanism within the DCA module, which not only orchestrates an organic integration of information but also assures the model's timely responsiveness to diverse information sources. This combination is then converted into a two-dimensional space by a 1 × 1 convolutional layer to encode the importance of each source feature. Normalization via a Softmax operation ensures that the sum of feature weights at each position equals one. Armed with these dynamic weights, the module purposefully amalgamates the two input features, highlighting pivotal information in the final output. In summary, fusion attention serves as the DCA module's "arbiter", enabling the precise extraction and utilization of key information from multiple sources. This, in turn, elevates the sensitivity and accuracy in remote sensing image analysis, underscoring the considerable potential of deep learning in complex tasks. The schematic structure of the fusion attention module is illustrated in Figure 4.
The loss function is a function used to calculate the error between the predicted values of a deep learning model and its true values. A smaller loss indicates a smaller discrepancy between the model's output and the target values. During the model training process, the changes in the loss function provide a basis for the backpropagation process, guiding parameter optimization and helping the model's output to more closely approximate the target values. Dice loss is particularly suitable for addressing class imbalance issues. The Dice coefficient is used to measure the consistency between the predicted segmentation area and the actual segmentation area. The spatial relationship between positive samples and negative samples represented by Dice loss effectively handles the problem of sample imbalance [[
(
(
In the formula 1,
The model in this study was a binary classification problem that performed pixel-wise classification to determine whether each pixel was a road or background. Roads constituted approximately 10% of the total area in the remote sensing image. To balance the advantages and drawbacks of different loss functions, this paper combines Dice loss with Binary Cross-Entropy (BCE) to propose a new Combined Loss function. In the early stages of prediction, when the prediction results are significantly different from the true labels, the BCE loss can provide a considerable gradient to help the model converge quickly. As the prediction approaches the actual label, the gradient of the BCE loss decreases. In contrast, Dice loss can provide consistent gradients at this stage, leading to better optimization of the model. This feature helps to optimize the model globally and locally, improves the handling of class imbalance problems, and enhances the generalization ability and robustness of the model. The Combined Loss function is formulated as follows:
(
In this section, we offer an in-depth analysis of the PCCAU-Net architecture, focusing particularly on its key constituents: the Pyramid Pathway Input, DCA (Dual-Input Cross Attention), and fusion attention modules. Initially, this section delineates the characteristics of the datasets employed and the experimental settings, thereby facilitating scientific reproducibility. Subsequently, we embark on a multifaceted quantitative and qualitative evaluation, pitting PCCAU-Net against various baseline models to obtain a comprehensive comparison. Furthermore, ablation studies elucidate the performance impact of each core component, substantiating both their efficacy and indispensability. The aim of this section is to furnish the reader with a holistic understanding of the capabilities and limitations of the proposed model.
In the diverse landscape of deep learning and computer vision, the fidelity and precision of model performance are predominantly dictated by the quality of datasets and meticulously engineered preprocessing pipelines [[
The Massachusetts Roads Dataset comprises high-resolution (1m per pixel) remote sensing imagery, with a total of 1171 images, each of 1500 × 1500 pixels, accompanied by meticulously annotated ground truth labels. The dataset is partitioned into 1108 training images, 14 validation images, and 49 test images, each subset replete with precise label information. To align with the experimental framework of this study, the original 1500 × 1500-pixel images underwent preprocessing to yield 512 × 512-pixel image blocks, culminating in a corpus of 10,098 images. These images were further distributed into training, validation, and test sets at a ratio of 7:2:1.
Originating from the CVPR DeepGlobe 2018 Road Extraction Challenge, the DeepGlobe Road Dataset consists of 8570 high-resolution (0.5 m per pixel) images, each measuring 1024 × 1024 pixels. This collection is segmented into 6626 training images, 1243 validation images, and 1101 test images. In particular, only the binary labels for the training set were publicly accessible subsequent to the conclusion of the challenge; therefore, this study specifically opted for the 6626 training images and their corresponding labels for experimental implementation purposes. These images were also partitioned randomly in a 7:2:1 ratio.
To ensure the reliability and reproducibility of this study, this section offers a comprehensive description of crucial aspects of the experimental design. The experimental setup is bifurcated into two main components: the first part, "Computational Environment and Hyperparameters", provides an exhaustive outline of the hardware specifications, operating system, development environment, and pivotal training hyperparameters. The second part, "Validation Metrics", elaborates on the multiple criteria employed for assessing model performance. Together, these components form the scaffolding of the experimental design and confer robust assurance to the accuracy and consistency of subsequent results. By delineating these key parameters and evaluation criteria explicitly, we aim to furnish a thorough and meticulous experimental design that facilitates accurate replication by other researchers.
All experiments were conducted on a high-performance computer outfitted with an Intel
To systematically quantify and evaluate the performance of our road extraction model in classification tasks, we employed a confusion matrix as the principal evaluative metric. This matrix delineates four key parameters: True Positives (TP), representing the accurate classification of pixels as road; True Negatives (TN), indicating correctly identified non-road pixels; False Positives (FP), specifying erroneously classified road pixels; and False Negatives (FN), marking road pixels incorrectly categorized as non-road. Guided by these fundamental metrics, we computed three principal evaluative indices:
Precision (P): this index gauges the model's accuracy in labeling pixels as roads (i.e., true positives). The formula employed is as follows:
(
Recall (R): this is employed to assess the model's capacity to correctly identify pixels that are, in fact, roads (i.e., true positives). The computational formula is as follows:
(
Intersection over union (IoU): this metric assesses the overlap between the model's predicted region and the actual ground truth, offering a robust measure of spatial accuracy. The computation is as follows:
(
Collectively, these indices provide a comprehensive and nuanced evaluation of model performance, focusing not only on identification accuracy but also on the model's ability to handle different classifications (i.e., true and false positives and negatives). Through this integrated evaluation framework, we aim to illuminate the model's strengths and weaknesses in road extraction tasks, thereby furnishing a holistic and in-depth understanding of its capabilities.
This chapter is devoted to an exhaustive inquiry into the practical utility of the PCCAU-Net model in the realm of road detection, achieved through systematic evaluation and analysis. To rigorously and comprehensively assess the performance of our model in the task of road extraction, this study engages a suite of benchmark models commonly employed in image segmentation, namely U-Net [[
Within this section, we undertake a comprehensive evaluation of the PCCAU-Net model using the Massachusetts Roads Dataset. The assessment is bifurcated into qualitative and quantitative dimensions. Firstly, the qualitative analysis employs visual comparisons to elucidate how PCCAU-Net fares against other advanced models in road extraction tasks. Secondly, the quantitative facet deploys a set of standard metrics to numerically gauge the model's performance. The synthesis of these evaluations aims to delineate both the advantages and potential of PCCAU-Net in high-resolution remote sensing road detection tasks. Qualitative results are illustrated in Figure 5, while quantitative outcomes are tabulated in Table 2.
Figure 5 showcases the performance of five models—U-Net, Residual U-Net (abbreviated as ResU-Net), DeepLab V3+, SDUNet, and our proposed PCCAU-Net—across five different scene captures from the Massachusetts Roads Dataset. These scenes represent a variety of intricate scenarios commonly encountered in remote sensing imagery. Scene 1 poses the particular challenge of severe spectral similarity between houses and roads, a hallmark issue in high-resolution remote sensing. While most models are susceptible to misclassifications in such contexts, PCCAU-Net demonstrates enhanced robustness. Scene 2 predominantly features a roundabout, which is traditionally a complicating factor in remote sensing analyses. PCCAU-Net manages to differentiate it accurately, despite its complex morphology. Scene 3 illustrates the issue of delineating dual carriageways, a task at which conventional models often fail by mistaking them for a single road; PCCAU-Net identifies them distinctly. Scene 4 examines the model's capability to recognize detailed roads, and PCCAU-Net excels in capturing these intricacies. Finally, Scene 5 focuses on a major urban road with shadow occlusions, where PCCAU-Net exhibits remarkable resilience. Collectively, Figure 5, through its multi-angular and multi-layered exposition, authentically reflects the versatility and adaptability of PCCAU-Net in dealing with a multitude of complex remote sensing scenarios. This observation not only validates the significant advantages of the PCCAU-Net model in practical applications but also accentuates its broad applicability in high-resolution remote sensing road detection tasks.
Table 2 offers a comprehensive quantitative evaluation of the performance of five advanced architectures—namely, U-Net, ResU-Net, DeepLab V3+, SDUNet, and our proposed PCCAU-Net—on the Massachusetts Roads Dataset. This tabulation employs three pivotal metrics—precision, recall, and intersection-over-union (IoU)—to furnish a robust framework for assessing the relative efficacy of these models in the task of road extraction.
- Precision: PCCAU-Net outperforms its peers with a precision rate of 83.395%, thereby indicating fewer false positives. The closest competitor, SDUNet, registers a precision of 80.094%, underscoring the substantial advantage of PCCAU-Net in minimizing identification errors.
- Recall: While SDUNet achieves the highest recall rate at 84.856%, indicating its prowess in detecting the majority of true road segments; PCCAU-Net trails closely with a recall of 84.076%. This demonstrates a well-calibrated balance between precision and recall, offering a more nuanced view of road attributes.
- IoU: PCCAU-Net leads with an IoU of 78.783%, narrowly eclipsing SDUNet's 78.625%. The superior IoU score for PCCAU-Net suggests its proficiency in delivering more accurate road segmentation, a critical aspect for applications demanding high precision.
In summary, PCCAU-Net manifests consistent superiority or near-top-tier performance across three key metrics, thereby substantiating its multifunctionality and robustness in high-resolution satellite imagery for road extraction. This exemplary performance validates the effectiveness of the attention mechanisms integrated within the DCA module, establishing PCCAU-Net as the preferred model for sophisticated road extraction tasks.
In this section, we delve into the experimental outcomes derived from deploying the PCCAU-Net model, along with several benchmark models, on the DeepGlobe Road Dataset. Similar to the approach adopted for the Massachusetts Roads Dataset, our evaluation incorporates both qualitative and quantitative analyses, illustrated in Figure 6 and Table 3, respectively.
Figure 6 presents road extraction results from five images, each embodying unique a challenge in diverse geographical settings: rural roads interspersed with natural scenery (Image 1), urban arterials suffering from significant occlusions (Image 2), suburban roads (Image 3), scenes with spectrally similar nonroad objects (Image 4), and highly intricate and dense urban road networks (Image 5).
Our analysis reveals that PCCAU-Net exhibits exceptional capabilities in road identification and extraction across these varying contexts, notably excelling in complex settings featuring occlusions and spectral similarities between roads and other elements. This superior performance can be attributed to the fusion attention mechanism embedded within the model's Dual-Input Cross Attention (DCA) module, which dynamically integrates information from diverse sources, thereby enhancing both the accuracy and robustness of the model for road extraction tasks across a spectrum of intricate and varied environments.
The results presented in Table 3 unequivocally indicate that the PCCAU-Net model outperforms a range of benchmark models—including U-Net, ResU-Net, DeepLab V3+, and SDUNet—across key performance metrics on the DeepGlobe Road Dataset. Of particular salience is the marked enhancement PCCAU-Net achieves in terms of precision, recall, and Intersection-over-Union (IoU). Specifically, the model attains a precision of 87.424%, a recall of 88.293%, and an IoU as high as 80.113%, thereby eclipsing the performance of other models under comparable conditions. For instance, when juxtaposed with U-Net, PCCAU-Net surpasses the latter's IoU by nearly five percentage points, thereby revealing its exceptional robustness and accuracy in confronting a multitude of geographical challenges, such as road occlusion and intricate road networks. Although the DeepLab V3+ model also performs commendably in the realms of precision and recall, its IoU score falls short of surpassing that of PCCAU-Net, further accentuating the latter's superior performance in image segmentation tasks. This quantitative evidence underscores PCCAU-Net's prowess in tackling complex road extraction tasks, largely attributable to its meticulously designed model, particularly the synergistic integration of Pyramid Pathway Input and DCA modules. Such innovative design elements not only bolster the model's discriminative accuracy in intricate settings but also enhance its adaptability and resilience across varied geographical environments.
To rigorously evaluate both the efficacy of our proposed PCCAU-Net model and the contributions of its individual components, a series of ablation experiments were conducted. It is worth noting that while CoordConv convolution was employed in the first layer of the U-Net network, we forwent isolated ablation tests for this design, as its effectiveness had already been substantiated in the existing literature. The ablation tests were structured into five experimental setups as follows:
Setup a (U-Net): Serving as the baseline, this setup employs the canonical U-Net architecture. It provides a performance reference and forms the basis for comparison with other configurations.
Setup b (U-Net + Pyramid Pathway Input): Building upon the baseline U-Net, this setup incorporates Pyramid Pathway Input, designed to enhance the model's multi-scale feature processing for more precise road extraction.
Setup c (U-Net + DCA without fusion attention): In this configuration, a stripped-down DCA module, devoid of fusion attention, is integrated into the U-Net model. This is aimed at evaluating the fundamental contribution of the DCA module to the task of road extraction.
Setup d (U-Net + Complete DCA): This setup supplements the U-Net with a fully fledged DCA module, inclusive of fusion attention. The design is intended to appraise the comprehensive performance of the DCA module, and it serves as a comparative reference against Setup C to gauge the specific contributions of fusion attention.
Setup e (PCCAU-Net): This is the full configuration of our proposed model, which amalgamates both the Pyramid Pathway Input and the complete DCA module. The setup aims to demonstrate the optimal performance attainable when all design elements are integrated.
Through these five configurations, we not only methodically assess the effectiveness of both the Pyramid Pathway Input and the DCA module, but also gain comprehensive insights into their synergistic impact on the overall task of road extraction. The qualitative outcomes for these setups across two datasets are illustrated in Figure 7, and the quantitative findings are delineated in Table 4.
The letters "a, b, c, d, and e" in Figure 7 stand for experimental setups a, b, c, d, and e, respectively. In Figure 7, extraction outcomes from multiple approaches are depicted across four images, each presenting distinct challenges. The U-Net-based Scheme a offered a rudimentary performance across all images, particularly faltering in instances of intricate road interweaving and heterogeneity within the same spectral class. Scheme b, augmented with Pyramid Pathway Input, displayed nuanced improvements in detail extraction, but remained inadequate for handling same spectrum heterogeneity. Schemes c and d, both incorporating DCA modules, markedly ameliorated the extraction of complexly intertwined roads, with Scheme d showing a discernible edge in detail fidelity. Notably, Scheme e (PCCAU-Net) consistently outperformed others across all metrics, excelling particularly in scenarios involving complex road configurations and spectral variability.
In a quantitative analysis presented in Table 4, all five ablation study schemes were rigorously assessed on two distinct datasets—Massachusetts Roads Dataset and DeepGlobe Road Dataset. The letters a, b, c, d, and e shown in Table 4 have the same meanings as those in Figure 7. A detailed breakdown reveals that Scheme a (U-Net) registered the lowest aggregate performance across both datasets, especially in terms of Intersection-over-Union (IoU), displaying a conspicuous performance gap compared to other models. Scheme b (U-Net + Pyramid Pathway Input) achieved subtle yet consistent elevations across all evaluation metrics, underscoring the ancillary benefits of Pyramid Pathway Input. Scheme c (U-Net + DCA without functional attention) further optimized performance in metrics like IoU, suggesting that even without functional attention, DCA can bolster model efficacy. Scheme d (U-Net + full DCA) exhibited additional improvements across all metrics over Scheme c, highlighting the pivotal role of functional attention within the DCA module. Scheme e (PCCAU-Net) led the charts in all metrics across both datasets, showcasing the conspicuous advantages of the proposed Pyramid Pathway Input and a comprehensive DCA module.
Collectively, the incremental improvements from Scheme a to e not only manifest in a gradual performance escalation across various metrics but also unveil the critical contributions of Pyramid Pathway Input and DCA modules—particularly including functional attention—in optimizing model performance. These elements individually offer distinct advantages and, when synergistically integrated into the model, yield superlative performance, as conclusively validated in Scheme e. This comprehensive analysis robustly substantiates the theoretical and experimental underpinnings of this study.
In the field of deep learning, the number of parameters and floating-point operations per second (FLOPs) serve as crucial measures for evaluating the complexity and computational demands of a model. A higher number of parameters might indicate a model's ability to learn more intricate patterns, but it could also lead to overfitting and extended training time. A comparison of the computational efficiency of different methods is shown in Table 5. Our model, the PCCAU-Net, demonstrated balanced attributes with respect to both parameter quantity and FLOPs. Therefore, while significantly improving segmentation performance, the PCCAU-Net did not introduce an excessive number of parameters or noticeably increase the training time. This demonstrates that our model not only prioritizes performance but also focuses on computational efficiency and model generalizability.
In the current study, we successfully introduce PCCAU-Net, an advanced road extraction model, and subject it to comprehensive performance evaluation using two distinct road datasets from Massachusetts and DeepGlobe. When juxtaposed with leading algorithms like U-Net, ResU-Net, DeepLab V3+, and SDUNet, PCCAU-Net consistently outperforms them in key performance metrics such as precision, recall, and Intersection-over-Union (IoU). The efficacy of individual model components is investigated deeply through a series of ablation experiments. Notably, the Pyramid Pathway Input and DCA modules emerge as decisive factors in performance enhancement. Intriguingly, even a DCA module devoid of function attention contributes to model improvement, albeit less so than the fully equipped DCA, underscoring the critical role of function attention within the DCA framework. In real-world scenarios marked by spectral variations, shadow occlusions, and intricate road structures, PCCAU-Net demonstrates robust generalization capabilities. These claims are substantiated by qualitative results where the model reliably extracts roads, even in settings marred by complex details or significant occlusions.
In summary, this paper not only advances an efficient and robust algorithm for road extraction, but also corroborates its superiority and practicality through comparative and ablation studies. It thus offers a new, reliable toolset for automated road extraction and potentially for a broader range of remote sensing image analyses. However, the computational complexity of the model affects real-time applications, and in future research we will focus on optimizing the computational requirements while maintaining performance.
In this study, we present PCCAU-Net, an advanced deep learning architecture engineered for the efficient and accurate extraction of roads. Extensive evaluations on the Massachusetts Roads Dataset and the DeepGlobe Roads Dataset reveal that PCCAU-Net consistently outperforms extant cutting-edge methodologies, including U-Net, ResU-Net, DeepLab V3+, and SDUNet, across multiple evaluation metrics. The efficacy of PCCAU-Net is attributed to its innovative Pyramid Pathway Input and DCA modules, explicitly tailored to handle the complex and diverse scenarios inherent in remote sensing imagery. Ablation studies further validate the positive impact of these modules on the model's performance. However, it is imperative to acknowledge the model's computational complexity as a limitation, rendering it potentially unsuitable for real-time or high-throughput applications. Future work will be directed towards optimizing the model to meet these operational constraints. Overall, the findings not only advance the field of automated road extraction, but also offer valuable insights for additional applications in remote sensing image analysis.
Graph: Figure 1 PCCAU-Net network structure diagram, in which Image represents the original size image input, P1, P2, P3 and P4 represent four different scales of pyramid path input, and the combination of blue and light pink feature blocks represents the Concat connection operation.
DIAGRAM: Figure 2 Schematic diagram of CoordConv operation details, where Input represents the input image, and x and y represent additional coordinate channels.
DIAGRAM: Figure 3 Schematic diagram of DCA module structure, where Input 1 and Input 2 represent two different inputs received by the module.
DIAGRAM: Figure 4 Schematic diagram of the fusion attention module structure.
Graph: Figure 5 Extraction results of five models on the Massachusetts Roads Dataset.
Graph: Figure 6 Extraction results of five models on the DeepGlobe Road Dataset.
Graph: Figure 7 The test results of the ablation experiment scheme on two public road datasets, where images 1 and 2 are from the Massachusetts Roads Dataset, and images 3 and 4 are from the DeepGlobe Road Dataset.
Table 1 Hyperparameter settings in experimental training.
Hyperparameters Value Epoch 100 Batch Size 4 2 Initial Learning Rate 1 × 10−4 ϵ 1 × 10−8 Weight Decay 0 Beta 1 0.9 Beta 2 0.999
Table 2 Quantitative Results Evaluation of Five Models on Massachusetts Roads Dataset.
Model Precision (%) Recall (%) IoU (%) U-Net 78.823 83.764 77.672 Resunet 79.297 83.253 77.902 DeeplabV3+ 79.567 83.909 77.457 SDUNet 80.094 84.856 78.625 PCCAU-Net 83.395 84.076 78.783
Table 3 Quantitative Results Evaluation of Five Models on DeepGlobe Road Dataset.
Model Precision (%) Recall (%) IoU (%) U-Net 83.436 84.454 75.365 Resunet 86.098 86.085 77.819 DeeplabV3+ 85.489 85.653 78.158 SDUNet 86.061 85.209 77.238 PCCAU-Net 87.424 88.293 80.113
Table 4 Quantitative results of five models of ablation experiments on two data.
Scheme Massachusetts Roads Dataset DeepGlobe Road Dataset Precision (%) ReCall (%) IoU (%) Precision (%) ReCall (%) IoU (%) a 78.823 83.764 77.672 83.436 84.454 75.365 b 79.182 84.035 77.768 84.273 85.145 76.092 c 80.357 83.913 77.937 84.224 86.047 77.642 d 81.136 83.849 78.217 85.423 86.864 78.652 e 83.395 84.076 78.783 87.424 88.293 80.113
Table 5 Comparison of computational efficiency of different methods.
Model Parameters (M) FLOPs (GLOPS) U-Net 29.97 5.63 Resunet 25.57 5.42 DeeplabV3+ 5.88 6.60 SDUNet 25.90 62.30 PCCAU-Net 24.35 60.17
Conceptualization—X.X.; Methodology—C.R. and Y.Z.; Software—C.R., A.Y. and C.D.; Validation—A.Y. and J.L.; Formal analysis—X.X., Y.Z. and Y.L.; Resources—X.X. and C.R.; Data curation—X.X.; Writing—original draft, X.X. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
The authors declare no conflict of interest.
By Xiaoqin Xue; Chao Ren; Anchao Yin; Ying Zhou; Yuanyuan Liu; Cong Ding and Jiakai Lu
Reported by Author; Author; Author; Author; Author; Author; Author