LiDAR point clouds as input - Luận văn 3d semantic- 123docz.net

LiDAR-based approaches have dominated outdoor semantic scene completion due to the inherent 3D nature and geometric accuracy of LiDAR sensors. Various recent point-clouds-based methods demonstrate the benefits of leveraging the sparse 3D point representations provided by LiDAR. These methods apply 3D convolutional or graph neural networks directly on point clouds to aggregate context and produce dense com- pleted voxel grids or mesh representations. Some techniques incorpo- rate projected image features or depth maps to provide complementary information. The sparse 3D points provide precise geometric cues, while RGB or depth data contributes to semantic understanding.

3.2.LIDARPOINT CLOUDS AS INPUT

21 LiDAR’s wide horizontal field of view provides complete scene observa- tion that aids completion, in contrast to limited camera frustums. The parallax from LiDAR’s perspective also helps disambiguate objects po- sitioned at different depths. Such geometric reasoning uniquely suits LiDAR to complex 3D reasoning tasks like SSC. We now briefly introduce five methodologies using point clouds directly as input, three of which are widely known, and the other two are the newest SOTA models in SSC.

LMSCNet [Roldao et al.,2020].LMSCNet is a novel deep-learning method for multiscale 3D semantic scene completion from sparse LiDAR scans proposed byRoldao et al.[2020]. It employs a mix of 2D and 3D convolutions in a UNet-style architecture to generate dense voxel predictions for both occupancy and semantic labels. A key advantage of LMSCNet is its lightweight model design, which enables fast inference speeds, mak- ing it suitable for applications like mobile robotics. Specifically, using 2D convolutions reduces computational complexity while multiple out- puts at different scales allow coarse scene analysis at over 300 FPS at the 1:8 scale ratio. The network performs excellently on semantic completion metrics at the released time while outperforming previous methods on occupancy metrics using the SemanticKITTI dataset. A relative dis- advantage is that some 3D spatial connectivity may be lost with the 2D convolutions, and the predictions on small objects are highly unreliable.

Overall, LMSCNet provides an excellent trade-off between accuracy and efficiency for 3D semantic scene completion. Its multiscale capability, in particular, makes it well-suited for applications requiring real-time coarse understanding of spatial layout and semantics.

JS3C-Net [Yan et al.,2021].JS3C-Net is a novel framework proposed for enhanced single sweep LiDAR point cloud semantic segmentation by exploiting learned contextual shape priors from semantic scene completion. The key advantage of JS3C-Net is its ability to overcome the performance limitations posed by the sparse and noisy nature of single- sweep LiDAR data by incorporating richer shape information from adja- cent LiDAR frames. Specifically, it consists of three main components - a semantic segmentation module, a semantic scene completion module, and an interaction module. By merging dozens of consecutive LiDAR frames, the semantic scene completion module is trained to generate

22 3.RECENTWORKS

complete semantic voxel representations, capturing compelling shape priors. The interaction module then allows for implicit knowledge transfer between the incomplete point cloud segmentation and the complete voxel completion, enabling mutual performance improvements. A high- light is that the scene completion components can be discarded after training without impacting segmentation inference speed. Quantitative results on SemanticKITTI demonstrate superior segmentation and completion performance over other methods. A limitation is the need for sequential LiDAR data spanning multiple frames during training, which may not always be readily available. In general, JS3C-Net pushes the boundaries of sparse LiDAR processing by exploiting sequence-based shape priors within a multi-task learning paradigm.

SSA-SC [Yang et al.,2021].SSA-SC is an end-to-end semantic segmentation- assisted scene completion network for LiDAR point clouds proposedYang et al.[2021]. It consists of a 2D completion branch and a 3D semantic segmentation branch. The key idea is to leverage the complementary information between the BEV map and 3D voxels from the two branches to produce reasonable completion results for outdoor 3D scenes. A key advantage of SSA-SC is that by adopting BEV representation and 3D sparse convolution, it can benefit from lower computational costs while maintaining effective feature representation. The network hierarchically merges features from the segmentation branch into the completion branch to provide semantic information and constraints that aid the completion task. Extensive experiments on the SemanticKITTI dataset demonstrate that SSA-SC performs well on semantic scene completion metrics with low latency. A limitation is that the performance on small object cat- egories is not as strong as methods with specialized geometric losses.

Overall, SSA-SC explores an effective combination of 2D and 3D networks for efficient semantic scene completion that reaches high completion metrics while retaining real-time inference speeds.

S3CNet [Cheng et al.,2021]. S3CNet is a sparse semantic scene completion network proposed byCheng et al.[2021] for reconstructing large outdoor driving scenes from single LiDAR scans. It utilizes sparse convolutional neural networks to efficiently process sparse 3D point clouds and jointly solve the coupled tasks of scene completion and semantic

3.2.LIDARPOINT CLOUDS AS INPUT

23 segmentation. Key advantages of S3CNet are its ability to handle large outdoor scenes characterized by sparsity and occlusion through specifically designed sparse tensor losses and post-processing refinement. It also outperforms prior dense convolutional approaches reliant on in- door RGB-D data. On the SemanticKITTI benchmark dataset, S3CNet achieves great performance with a mean IoU of 29.5%. A 2D variant is also introduced to complement the 3D network predictions. However, a drawback is that S3CNet still struggles with distant small objects. Its multi-view fusion strategy is unable to completely offset the information loss from exponential sparsity growth. Real-time performance is also currently unfeasible due to computational demands. Overall, S3CNet represents an important step towards robust and efficient 3D scene understanding essential for autonomous navigation. Extensions to improve speed, fusion techniques, and spatial encoding show promise in over- coming limitations.

SSC-RS [Mei et al.,2023]. SSC-RS is a neural network proposed for semantic scene completion of LiDAR point clouds for autonomous driving systems. It takes a novel perspective of representation separation and BEV fusion to address this task. Specifically, SSC-RS uses two separate branches with deep supervision to explicitly disentangle the learning of semantic context and geometric structure representations. It also lever- ages a lightweight BEV fusion network to efficiently aggregate these representations captured at multiple scales. A key component is the Adap- tive Representation Fusion module designed to selectively fuse infor- mative cues from the two representations. Experiments show that SSC- RS achieves state-of-the-art performance on the SemanticKITTI benchmark while running in real-time (nearly 17 fps). The explicit disentangle- ment of representations is demonstrated to facilitate optimization and improve accuracy. The BEV fusion paradigm also allows efficient com- putation and memory usage compared to dense 3D networks. One limitation is that SSC-RS does not explicitly model local geometry, leading to lower accuracy on small objects. Future work can focus on incorporating local shape priors to further boost performance. Overall, SSC-RS explores a novel and effective approach for semantic scene completion, with advantages in representation learning and efficiency.

24 3.RECENTWORKS

SCPNet [Xia et al.,2023]. SCPNet is a semantic scene completion network for LiDAR point clouds proposed by Xia et al. It focuses on ad- dressing key challenges in outdoor semantic scene completion, includ- ing sparse and incomplete inputs, numerous objects across varying scales, and label noise from dynamic objects. SCPNet introduces three main solutions: First, redesigning the completion subnetwork to aggregate multi-scale features without lossy downsampling, maintaining sparsity for efficiency while maximizing information retention from raw point clouds; second, distilling dense semantic knowledge from a teacher multi- frame model to the student single-frame model using a novel pairwise semantic similarity loss termed DSKD; third, rectifying completion labels by removing long noisy traces left by dynamic objects using off- the-shelf panoptic labels. Experiments conducted on SemanticKITTI and SemanticPOSS datasets show SCPNet achieves state-of-the-art performance, outperforming a top method, S3CNet, by 7.2 mIoU on Se- manticKITTI. The learned features also transfer well to segmentation tasks. Limitations include introduced computational overhead from the redesigned completion subnetwork. Overall, SCPNet sets a new state-of- the-art for semantic scene completion through architectural design in- novations and objectives tailor-made for effectively learning from sparse outdoor LiDAR scans.