Luận văn 3d semantic scene completion for autonomous driving

es-The main target of the SSC task is to reconstruct a complete 3D scene cupancy with semantic labels assigned to each voxel from just a sparseand incomplete point cloud input.. S PARSE

Motivation

Self-driving technology is a revolutionary change in the automotive industry It enables vehicles to drive and operate independently, without human input, using advanced sensors, artificial intelligence, and sophis- ticated algorithms The main benefits of self-driving technology are improving road safety, reducing traffic congestion, and making transporta- tion more accessible However, one of the key challenges in achieving full autonomy is how the vehicle can perceive and understand its environment correctly Sensors like cameras and LiDAR can capture some aspects of the surroundings but cannot see everything in a 360-degree view 3D semantic scene completion (SSC) solves this problem by creat- ing a complete environmental model from partial sensory data.

3D semantic scene completion integrates 3D scene completion and 3D semantic segmentation 3D scene completion is an important problem in computer vision that tries to infer the complete 3D structure of an environment from incomplete observations 3D semantic segmentation goes further and assigns semantic labels like “building”, “road”, “vegeta- tion”, etc., to different 3D elements This ability is crucial for self-driving

2 1.INTRODUCTION cars, allowing them to detect and understand objects beyond their direct vision Some advantages of 3D semantic scene completion are:

• Filling the gaps Sensors like cameras or LiDAR might not capture everything due to occlusions from other vehicles, buildings, or even the self-driving car itself SSC uses AI to analyze the captured data and predict what’s likely in the unseen areas, providing a more complete picture of the scene.

• Understanding the scene SSC goes beyond just seeing objects It can classify those objects, recognizing a car, pedestrian, or traffic sign This semantic understanding helps the car make informed decisions about navigation and potential hazards.

• Reduces sensor load By inferring hidden parts of a scene, scene completion reduces reliance on dense sensor coverage of the environment This enables more efficient sensor suites.

• Improved safety By predicting what’s around corners or beyond obstacles, SSC allows autonomous vehicles to react more proac- tively to potential dangers This can significantly improve safety on the road.

• Simplifies planning and control.Operating on semantically mean- ingful scene representations enables intuitive formulations of driving policies based on human logic This makes the development and verification of autonomous vehicles easier.

Semantic scene completion represents a significant stride toward repli- cating human-like scene understanding, contributing to the enhance- ment of safety, adaptability, and efficiency in autonomous driving systems While this technology holds great promise, the realm of outdoor semantic scene completion is still in its early stages and presents unique challenges The inherent complexity and vast scale of outdoor environments introduce the demand for innovative solutions.

In the pursuit of perfecting outdoor semantic scene completion, researchers have a wide array of opportunities to explore and contribute to this evolv- ing field The dynamic nature of outdoor scenes, with factors such as

Goals

3 variable lighting conditions, diverse terrains, and the unpredictability of traffic scenarios, requires comprehensive investigations By delving into these challenges, we aim to refine existing methodologies and develop novel approaches to address the intricacies posed by the outdoor environment.

In this Capstone Project, we first examine recent methodologies employed in addressing SSC problems for autonomous driving, aiming to provide a comprehensive and critical overview of the ongoing progress in SSC development Furthermore, we also delve into recent state-of- the-art methods, conducting an in-depth analysis to identify their strengths and weaknesses The ultimate objective of this undertaking is to gain a thorough understanding of the current success in SSC research, as well as the persisting challenges Additionally, we suggest a novel method that resolves some of the remaining limitations of the prior research.

Scopes

For this capstone project, we define our scopes as follows:

• Perform a literature review of recent papers on 3D semantic scene completion tailored to autonomous driving scenarios.

• Analyze these works’ core technical approaches, innovations, and results Look for common themes and trends.

• Analyze previous approaches for SSC tasks Evaluate and compare performance on relevant datasets.

• Propose our methods to address the varying point density in point cloud data while ensuring that the model remains robust and efficient.

Report Structure

Voxel grids

Voxels have emerged as a critical volumetric representation for capturing 3D spatial data A voxel represents a value on a regular grid in three- dimensional space, analogous to a pixel which represents 2D image data.

By stacking voxels, complex volumetric objects and scenes can be con- structed Each voxel encodes one or more property values, such as color, transparency, density, or semantic classification Volume rendering techniques then allow projection and visualization of internal structures.

Advantages Voxel representations offer several significant advantages over meshes or point clouds in analyzing 3D environments The regular structure provides an efficient and intuitive format for the storage, processing, and visualization of data Neighborhood information is implicitly encoded, allowing the propagation of signals across space Vox- els also uniformly sample environments, providing spatial occupancy data even without well-defined surfaces Software and hardware accel- eration via established texture mapping techniques further enable real- time rendering.

Drawbacks.Voxels also come with some major limitations The need for discretization and sampling at fixed resolutions inherently limits precision, which lowers the ability to represent fine details compared to point clouds Voxel storage costs scale cubically with the resolution, present- ing memory and processing bottlenecks for representing large scenes. However, recent works have developed compressed formats and octree representations [Li, Sima, Dai, Wang, Lu, Wang, Zeng, Li, Yang, Deng et al., 2023] to allow greater scalability while retaining key voxelization advantages.

Overall, the strengths of contextual representation and volumetric occupancy modeling make voxels well-suited for semantic analysis and completion tasks where label prediction is vital across full 3D spaces.

Sign distance function (SDF)

Sign distance functions (SDFs) have emerged as a compact implicit surface representation for 3D shape modeling and reconstruction Unlike meshes or voxels, which define space occupancy explicitly, SDFs represent surfaces implicitly as the boundary of a continuous scalar field. Precisely, the SDF measures the shortest signed Euclidean distance from any spatial point to the closest surface Points outside shapes have pos- itive distance values, while negative values indicate interior points The surface itself lies at the zero isocontour.

Advantages SDFs offer several major advantages as a 3D shape representation They provide a continuous representation, allowing high precision and resolution independence, in contrast to discretized voxels. SDFs also intrinsically encode global interior/exterior context for robust surface positioning Compact storage as a scalar field enables memory efficiency and fast computations using field operations.

Drawbacks SDFs, however, have traditionally struggled to represent complex topological structures Recent works have aimed to address this through composition mechanisms and hierarchical modeling.

Graph-based representaion

Graph-based representations have emerged as an intriguing direction for modeling 3D scenes Rather than explicit geometrical representations such as meshes or voxels, graph networks encode environments as nodes and relations in an abstract space Scene components such as objects, surfaces, or semantic labels can correspond to nodes, connected by edges representing positional, semantic, or other relationships Built on graph theory principles, modeling in network space allows inference

10 2.THEORETICALBACKGROUNDOVERVIEW of global properties through localized node messaging updates.

Advantages.Several advantages arise from graph modeling of 3D data It offers a lightweight and flexible scene encoding without requiring dense voxels or meshes constrained to uniform resolution or topological structures Graph networks also provide an intuitive format for the propagation of information and relationships across larger spaces and semantic contexts By directly operating in the spectral domain, graph networks enable efficient data completion and smoothing.

Drawbacks The use of graph-based representation has proven to be a powerful tool However, challenges still need to be addressed when it comes to consistently reconstructing geometrical information from abstract graph spaces This is an active area of research, as scientists and researchers are working towards finding solutions to these challenges.

Bird’s-Eye-View (BEV) perception

Bird’s-eye-view (BEV) representations provide an intriguing alternative to conventional 3D scene representations like meshes or voxels Rather than modeling detailed geometry, a bird’s-eye-view encodes scenes as 2D grids from a fixed overhead perspective Each grid cell stores at- tributes like occupancy, height, semantics, or color to represent structures and spaces in aggregate This compact encoding captures key aspects of scenes relevant to navigation and planning tasks.

Advantages Bird’s-eye-views offer several major advantages over 3D representations that specifically suit outdoor completion challenges Firstly,BEV perception eliminates occlusion and scale problems, making it easier to recognize objects and resolve occlusion issues Secondly, the 2D overhead view provides straightforward spatial reasoning and path planning unhindered by perspective complexities Thirdly, grid-based encoding allows efficient convolution and signal propagation by exploiting regular structures Additionally, BEV perception is highly compati- ble with LiDAR-based methods, showcasing superior performance and industry-friendly deployment Furthermore, advancements have been

Fusion strategies

11 made in incorporating multi-camera inputs, allowing for easy feature fusion from different modalities.

Drawbacks However, there are still challenges and limitations in BEV perception that require further innovation A primary weakness is the loss of vertical structure detail during 3D-to-2D projection With explicit height discarded, BEV completion networks cannot reliably reconstruct elevated structures like poles, trees, and walls, which provide vital contextual cues Occlusion ambiguities also arise frequently from above, with occupied spaces hidden under foliage or distant structures The limited field of view also introduces perspective distortion, causing spatial misalignments, especially for regions far from the projection cen- ter point Furthermore, compressing rich 3D sensor inputs like LIDAR depth scans or RGB images into 2D grids removes measurement cues critical for geometry and semantics reconstruction against observations. Finally, planarity assumptions in BEVs fail for sloped terrain, resulting in errors aggregating occupancy and semantics for inclined surfaces.

A key challenge in autonomous driving tasks such as SSC is effectively fusing information from different modalities, for instance, RGB images,

12 2.THEORETICALBACKGROUNDOVERVIEW depth data, surface normals, etc., to produce accurate and coherent scene completions Different fusion strategies have been explored in recent literature to integrate multi-modal sensory cues:

Early fusion.In early or direct fusion techniques, features from multiple modalities are concatenated into a joint representation before feeding into completion modules For example,Song et al.[2017] extracts geometric features from depth and surface normal maps which are concatenated channel-wise with RGB features from an encoder-decoder network to predict semantic voxel completions of indoor scenes While con- ceptually simple, these techniques do not model interactions between modalities.

Late fusion Late or decision-level fusion combines predictions from separate networks, each processing one modality For instance,Dai et al.

[2018] train separate RGB and depth completion networks and fuse their outputs using a weighted average to produce the final indoor completed semantics However, these approaches do not facilitate cross-modal interactions at the feature level.

Middle fusion Middle fusion strikes a balance by fusing information continuously throughout the completion network, either in a single stage or multiple stages (as depicted in2.2(c)) For instance,Mei et al.[2023] propose a joint fusion method called SSC-RS that separates input representations while enabling cross-modal fusion across stages They first extract distinct geometric and semantic embeddings from LiDAR inputs using separate specialized encoders The embeddings are fused with the BEV features via an adaptive representation fusion module to enrich the final representation with both modal cues Experiments demonstrate middle fusion outperforms single-stage early and late fusion alternatives and is highly efficient for SSC tasks.

In summary, joint fusion techniques that allow representations and predictions from different sensors to interact through cross-modal connec- tions have shown promising performance by enabling synergistic integration of multi-modal data for high-quality semantic scene completion Continued research on robust and efficient fusion strategies is an

Sparse convolutional neural network

13 active area of investigation in this domain.

Semantic scene completion often involves processing 3D data, which can be memory and computationally intensive for standard convolutional networks since most of the voxels of the input are empty Recent works have utilized the 3D sparse convolutional networks to improve efficiency.

Standard 3D convolutions densely compute features for all voxels in the input volume In contrast, sparse convolutions only activate computations on occupied voxels indicated by an accompanying sparse tensor. This reduces redundant calculations on empty space, saving substantial memory and FLOPs.

A common sparse data structure is the hash table, which uses a hash function to encode voxel coordinates for efficient lookup and feature re- trieval Graham et al [2018] design the sparse convolutional network Submanifold Sparse ConvNet (SSCN), which employs hash tables to determine activated voxels for computation On scene completion benchmarks, SSCN reduces memory usage by up to 100x and speeds up inference by 2-4x over regular 3D CNNs with minimal accuracy drop.

Another line of work utilizes octrees, a tree structure that hierarchically partitions space into octants OctNet [Riegler et al.,2017] and O-CNN [Wang et al., 2017] build octrees to store occupied voxel locations, enabling more efficient sparse kernels The hierarchical partitioning also allows computation to be adapted based on voxel density in different sub-regions.

In summary, 3D sparse convolutional networks leverage data structures like hash tables and octrees to skip empty areas during convolution This allows dramatic savings in computation and memory with little loss in output quality Efficiency improvements by sparse networks make training and deploying high-resolution 3D semantic scene completion more

Graph neural network (GNN)

Point clouds provide a flexible geometric representation for 3D scenes. However, they lack grid structure, making adoption into regular convolutional networks difficult Recent approaches have explored modeling point clouds as graphs instead, where points become nodes and edges capture spatial relations Graph neural networks (GNNs) are then applied for feature learning and semantic prediction.

GNNs are neural network models that operate directly on graph-structured data, accounting for both feature information of nodes and the topology of the graph connectivity GNNs can refine node embeddings in a context-aware manner by propagating information along the edges of the graph Various GNN architectures have been proposed, using propagation schemes based on spectral graph theory or localized message passing.

For semantic scene completion, graph-based representations allow the exploitation of relational structure in 3D scenes, while GNN models provide suitable machinery for processing such representations By propagating information along geometric and semantic relations, GNNs can refine understanding of both modeled and unobserved parts of a scene. After refinement, node features can be decoded to produce complete semantic 3D representations GNNs offer several advantages for SSC tasks:

Learning from local relationships GNNs can effectively capture the spatial and feature-based relationships between points, leading to richer and more accurate semantic representations.

Flexibility in graph construction Different graph construction methods can be tailored to specific tasks or datasets, allowing for flexibility in capturing relevant information.

Data efficiency GNNs can operate directly on the sparse point cloud

Capsule network (CapsNet)

15 representation, reducing the need for extensive data pre-processing and memory requirements.

None of the recent works have developed end-to-end GNN models for semantic scene completion, although it has demonstrated strong performance on the point cloud segmentation task Numerous challenges remain regarding modeling complex scenes, generalization across environments, interpretability, and efficiency Ongoing research is focused on developing more powerful and flexible GNN architectures tailored to semantic 3D tasks.

Overall, graph neural networks show promise for point cloud-based completion by integrating both semantic and geometric relationships Key advantages include flexible neighborhood modeling and avoidance of quantization errors Ongoing challenges include the computational complexity of dynamic graphs and effectively combining completion cues across different modalities and representations.

Traditional convolutional neural networks struggle to capture spatial relationships and part-whole hierarchies within objects Capsule networks (CapsNet) [Sabour et al.,2017] were proposed to address these issues using capsules.

Capsules are groups of neurons that encode the instantiation parameters of visual entities such as objects and parts Capsules are activated when a visual entity is detected, and the output vector of a capsule represents the properties of that entity The key characteristics of Capsule Networks are:

Capsule routing Capsules in lower layers are routed dynamically to capsules in higher layers, allowing part-whole relationships to be learned.

Vector output The vector output of a capsule encodes properties like pose, orientation, and scale instead of just detecting the presence of an

Pose prediction.Capsules predict the poses of entities in the next layer, and the error between predicted and actual poses is used as the loss function.

These characteristics allow capsule networks to capture better the spatial relationships and hierarchical part decomposition within objects.

3D Capsule-based network has shown excellent performance over the CNN backbones in medical segmentation tasks Experiments byAvesta et al [2023] demonstrate that capsule networks are effective for brain image segmentation, including segmenting images not well-represented in the training data Compared to alternative methods, capsule networks provide competitive segmentation accuracy while being computationally efficient Specifically, capsule networks successfully segmented diverse brain images not contained in the original training set This generalization indicates their potential to segment rare or unusual scans accurately Runtime evaluations reveal capsule networks segment images faster than other approaches with similar precision Together, these results highlight the promise of capsule networks to enable precise and efficient analysis of complex medical images.

For semantic scene completion, capsule networks could help model the spatial relationships between objects and parts within a 3D scene The vector outputs of capsules could encode properties like object poses, orientations, and scales The capsule routing mechanism could learn how object parts compose whole objects within the scene.

In summary, capsule networks can potentially improve semantic scene completion by better capturing the hierarchical spatial relationships and part decomposition within 3D scenes The vector outputs and routing mechanisms of capsules align well with the requirements of this task.However, further research is needed to develop efficient 3D capsule networks and evaluate their performance on semantic scene completion.

The pioneering work ofSong et al.[2017] was the first to tackle scene completion and semantic labeling of depth maps together They introduced the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network SSCNet takes a single depth image as input and concurrently predicts occupancy and semantic labels for all voxels in the camera view frustum This demonstrated that learning these two tasks jointly can lead to mutual benefits Since then, many impressive SSC methods have been developed with good results However, top performance on the challenging SemanticKITTI benchmark is still a little under 37% mIoU and 60% IoU (as shown in Section5), indicating substantial room for improvement.

Reconstructing a dense 3D scene from either 2D or sparse 3D data poses a challenging problem due to the inherent lack of adequate information to resolve all uncertainties Given the 3D nature of the task, there are clear benefits to using 3D inputs as they already provide geometric insights Therefore, using sparse 3D surfaces such as occupancy grids, distance fields, or point clouds as input becomes more practical Another approach is to use RGB data with depth data since they have the same spatial alignment and can be easily processed by 2D CNNs Recent research can be classified into three groups based on their input encoding,

18 3.RECENTWORKS and this overview examines papers from each of these groups.

RGB images as input

RGB-based methods have demonstrated promising performance for outdoor semantic scene completion tasks The advancement of deep learning techniques for 3D reconstruction from 2D images has revealed a potential low-cost solution using only RGB cameras This contrasts with more expensive, bulky laser sensors like LiDAR The promise of camera- based SSC approaches as an economical alternative has motivated in- creased study of monocular methods in recent years as an emerging trend By leveraging deep learning to lift 2D image features into detailed 3D semantic representations, RGB-based models provide a compelling balance of performance and practicality Four particularly notable SSC methods have been proposed for outdoor scenes in the last two years.

MonoScene [Cao and de Charette,2022].MonoScene, proposed byCao and de Charette[2022], is one of the first models to consider only a monocular RGB image as input MonoScene pioneers monocular 3D semantic scene completion from a single RGB image for both indoor and outdoor scenes by proposing a novel 2D-3D network architecture consist- ing of successive 2D and 3D U-Nets connected via a specialized 2D- 3D feature projection technique inspired by optics It further incorporates a 3D contextual relation module and complementary scene completion losses to enhance context awareness and provide additional constraints The experiments show encouraging results on public benchmarks, with qualitative findings indicating the possibility of accurately depicting both geometry and semantics extending beyond the image boundaries Limitations include sensitivity to the change of camera parameters, deficiencies in fine details recovery, and inferior performance compared to approaches leveraging explicit 3D supervision Overall, MonoScene sets a new state-of-the-art for single-view, RGB-based semantic scene completion through its architectural innovations and shows promising generalization capabilities.

VoxFormer [Li, Yu, Choy, Xiao, Alvarez, Fidler, Feng and Anandkumar,

2023] VoxFormer is a novel two-stage framework for camera-based 3D semantic scene completion proposed byLi, Yu, Choy, Xiao, Alvarez, Fi- dler, Feng and Anandkumar[2023] It aims to generate a complete volumetric representation of semantics from only 2D images The key innovation is a reconstruction-before-hallucination strategy implemented by first proposing a sparse set of voxel queries from image depth to reconstruct visible structures, followed by a Transformer-based architecture to propagate information and complete the representations based on the proposed sparse voxels Specifically, VoxFormer consists of a lightweight convolutional neural network that leverages estimated depth maps to propose likely occupied voxel queries in stage 1 Stage 2 then strength- ens these voxel features via cross-attention and updates all voxels, including unoccupied ones associated with learned mask tokens, through self-attention layers to enable interactions This allows for effectively envisioning entire semantic meanings Experiments demonstrate that VoxFormer gains significant improvements in safety-critical short-range areas while being more computationally and memory efficient than existing approaches However, the performance still lags behind LiDAR- based methods, and long-range completion quality could be further enhanced Nonetheless, VoxFormer represents progress towards inexpen- sive and widely deployable camera-only semantic mapping solutions for autonomous vehicles.

OccFormer [Zhang et al.,2023] OccFormer is a novel dual-path transformer network proposed for monocular camera-based 3D semantic occupancy prediction It achieves long-range, dynamic, and efficient camera- generated 3D voxel feature encoding by decomposing the heavy 3D processing into complementary local and global transformer pathways The key advantage of OccFormer is using attention mechanisms rather than3D convolutions to process sparse and irregular 3D voxel features, enabling more effective semantics reasoning Specifically, the local path- way captures fine details along the horizontal slices while the global path- way extracts scene-level layout on the Bird’s Eye View Their features are then adaptively fused For decoding, OccFormer adapts Mask2Former architecture and further proposes to preserve pooling and class-guided sampling to handle inherent sparsity and class imbalance The reported results show that OccFormer achieved state-of-the-art performance on

SemanticKITTI for semantic scene completion A limitation is the heavy memory and computation cost scaling quadratically with sequence length for attention, requiring techniques to improve efficiency But overall, the dual-path design and transformers make OccFormer well-suited for 3D semantic occupancy prediction from monocular camera inputs.

Symphonize [Jiang et al.,2023] Symphonize by Jiang et al [2023] is a pioneering 3D semantic scene completion model that effectively ex- ploits instance-centric semantics and global context to enhance scene modeling and understanding It introduces sparse instance queries de- rived from input RGB images to selectively aggregate multi-scale features and voxel representations Through continuous feature propagation between images, instances, and voxels using the proposed Serial Instance-Propagated Attention mechanism, Symphonize facilitates dimension promotion from 2D to 3D while resolving geometric ambiguity by leveraging contextual information A key advantage is the ability to capture fine-grained details and produce coherent scene layouts, significantly outperforming prior RGB-based methods The Depth-RectifiedVoxel Proposal Layer also establishes initial geometry awareness by gen- erating proposals on estimated surfaces However, the lack of instance- level supervision may affect modeling granularity By orchestrating images, instances, and volumes, Symphonize paves the way for future instance- centric scene analysis.

LiDAR point clouds as input

LiDAR-based approaches have dominated outdoor semantic scene completion due to the inherent 3D nature and geometric accuracy of LiDAR sensors Various recent point-clouds-based methods demonstrate the benefits of leveraging the sparse 3D point representations provided byLiDAR These methods apply 3D convolutional or graph neural networks directly on point clouds to aggregate context and produce dense completed voxel grids or mesh representations Some techniques incorporate projected image features or depth maps to provide complementary information The sparse 3D points provide precise geometric cues, whileRGB or depth data contributes to semantic understanding.

LiDAR’s wide horizontal field of view provides complete scene observa- tion that aids completion, in contrast to limited camera frustums The parallax from LiDAR’s perspective also helps disambiguate objects positioned at different depths Such geometric reasoning uniquely suits LiDAR to complex 3D reasoning tasks like SSC We now briefly introduce five methodologies using point clouds directly as input, three of which are widely known, and the other two are the newest SOTA models in SSC.

LMSCNet [Roldao et al.,2020].LMSCNet is a novel deep-learning method for multiscale 3D semantic scene completion from sparse LiDAR scans proposed byRoldao et al.[2020] It employs a mix of 2D and 3D convolutions in a UNet-style architecture to generate dense voxel predictions for both occupancy and semantic labels A key advantage of LMSCNet is its lightweight model design, which enables fast inference speeds, making it suitable for applications like mobile robotics Specifically, using 2D convolutions reduces computational complexity while multiple outputs at different scales allow coarse scene analysis at over 300 FPS at the 1:8 scale ratio The network performs excellently on semantic completion metrics at the released time while outperforming previous methods on occupancy metrics using the SemanticKITTI dataset A relative dis- advantage is that some 3D spatial connectivity may be lost with the 2D convolutions, and the predictions on small objects are highly unreliable. Overall, LMSCNet provides an excellent trade-off between accuracy and efficiency for 3D semantic scene completion Its multiscale capability, in particular, makes it well-suited for applications requiring real-time coarse understanding of spatial layout and semantics.

JS3C-Net [Yan et al.,2021].JS3C-Net is a novel framework proposed for enhanced single sweep LiDAR point cloud semantic segmentation by exploiting learned contextual shape priors from semantic scene completion The key advantage of JS3C-Net is its ability to overcome the performance limitations posed by the sparse and noisy nature of single- sweep LiDAR data by incorporating richer shape information from adja- cent LiDAR frames Specifically, it consists of three main components - a semantic segmentation module, a semantic scene completion module,and an interaction module By merging dozens of consecutive LiDAR frames, the semantic scene completion module is trained to generate

22 3.RECENTWORKS complete semantic voxel representations, capturing compelling shape priors The interaction module then allows for implicit knowledge transfer between the incomplete point cloud segmentation and the complete voxel completion, enabling mutual performance improvements A highlight is that the scene completion components can be discarded after training without impacting segmentation inference speed Quantitative results on SemanticKITTI demonstrate superior segmentation and completion performance over other methods A limitation is the need for sequential LiDAR data spanning multiple frames during training, which may not always be readily available In general, JS3C-Net pushes the boundaries of sparse LiDAR processing by exploiting sequence-based shape priors within a multi-task learning paradigm.

SSA-SC [Yang et al.,2021].SSA-SC is an end-to-end semantic segmentation- assisted scene completion network for LiDAR point clouds proposedYang et al.[2021] It consists of a 2D completion branch and a 3D semantic segmentation branch The key idea is to leverage the complementary information between the BEV map and 3D voxels from the two branches to produce reasonable completion results for outdoor 3D scenes A key advantage of SSA-SC is that by adopting BEV representation and 3D sparse convolution, it can benefit from lower computational costs while maintaining effective feature representation The network hierarchically merges features from the segmentation branch into the completion branch to provide semantic information and constraints that aid the completion task Extensive experiments on the SemanticKITTI dataset demonstrate that SSA-SC performs well on semantic scene completion metrics with low latency A limitation is that the performance on small object cat- egories is not as strong as methods with specialized geometric losses. Overall, SSA-SC explores an effective combination of 2D and 3D networks for efficient semantic scene completion that reaches high completion metrics while retaining real-time inference speeds.

S3CNet [Cheng et al.,2021] S3CNet is a sparse semantic scene completion network proposed byCheng et al.[2021] for reconstructing large outdoor driving scenes from single LiDAR scans It utilizes sparse convolutional neural networks to efficiently process sparse 3D point clouds and jointly solve the coupled tasks of scene completion and semantic

23 segmentation Key advantages of S3CNet are its ability to handle large outdoor scenes characterized by sparsity and occlusion through specifically designed sparse tensor losses and post-processing refinement It also outperforms prior dense convolutional approaches reliant on indoor RGB-D data On the SemanticKITTI benchmark dataset, S3CNet achieves great performance with a mean IoU of 29.5% A 2D variant is also introduced to complement the 3D network predictions However, a drawback is that S3CNet still struggles with distant small objects Its multi-view fusion strategy is unable to completely offset the information loss from exponential sparsity growth Real-time performance is also currently unfeasible due to computational demands Overall, S3CNet represents an important step towards robust and efficient 3D scene understanding essential for autonomous navigation Extensions to improve speed, fusion techniques, and spatial encoding show promise in overcoming limitations.

SSC-RS [Mei et al.,2023] SSC-RS is a neural network proposed for semantic scene completion of LiDAR point clouds for autonomous driving systems It takes a novel perspective of representation separation and BEV fusion to address this task Specifically, SSC-RS uses two separate branches with deep supervision to explicitly disentangle the learning of semantic context and geometric structure representations It also leverages a lightweight BEV fusion network to efficiently aggregate these representations captured at multiple scales A key component is the Adap- tive Representation Fusion module designed to selectively fuse informative cues from the two representations Experiments show that SSC-

RS achieves state-of-the-art performance on the SemanticKITTI benchmark while running in real-time (nearly 17 fps) The explicit disentangle- ment of representations is demonstrated to facilitate optimization and improve accuracy The BEV fusion paradigm also allows efficient computation and memory usage compared to dense 3D networks One limitation is that SSC-RS does not explicitly model local geometry, leading to lower accuracy on small objects Future work can focus on incorporating local shape priors to further boost performance Overall, SSC-RS explores a novel and effective approach for semantic scene completion,with advantages in representation learning and efficiency.

SCPNet [Xia et al.,2023] SCPNet is a semantic scene completion network for LiDAR point clouds proposed by Xia et al It focuses on addressing key challenges in outdoor semantic scene completion, including sparse and incomplete inputs, numerous objects across varying scales,and label noise from dynamic objects SCPNet introduces three main solutions: First, redesigning the completion subnetwork to aggregate multi-scale features without lossy downsampling, maintaining sparsity for efficiency while maximizing information retention from raw point clouds; second, distilling dense semantic knowledge from a teacher multi- frame model to the student single-frame model using a novel pairwise semantic similarity loss termed DSKD; third, rectifying completion labels by removing long noisy traces left by dynamic objects using off- the-shelf panoptic labels Experiments conducted on SemanticKITTI and SemanticPOSS datasets show SCPNet achieves state-of-the-art performance, outperforming a top method, S3CNet, by 7.2 mIoU on Se- manticKITTI The learned features also transfer well to segmentation tasks Limitations include introduced computational overhead from the redesigned completion subnetwork Overall, SCPNet sets a new state-of- the-art for semantic scene completion through architectural design innovations and objectives tailor-made for effectively learning from sparse outdoor LiDAR scans.

Multi-sensor data fusion

Multi-sensor approaches aim to capitalize on the complementary strengths of different data modalities for outdoor semantic scene completion The sparse 3D points from LiDAR provide precise geometric cues to aid completion, while the dense 2D semantics from RGB or depth maps contribute enhanced semantic understanding The multi-view imagery also helps disambiguate objects positioned at different depths unseen by Li- DAR’s limited vertical perspective Jointly training on multi-modal data enables implicit learning to leverage the complementary strengths of each sensor The combined coverage, resolution, and reasoning capaci- ties unlock a more holistic understanding of the scene.

Despite the potential advantages of using multi-sensor inputs, outdoor

Discussion

25 semantic scene completion (SSC) models based on fused multi-sensor data remain challenging One major issue stems from the multimodal- ity of the data sources themselves The various sensors used, such as cameras, LiDAR, and radar, differ in measurement units, resolution, and spatiotemporal alignment To effectively utilize the diversity offered by these heterogeneous modalities, the data streams must be precisely aligned spatially, geometrically, and temporally Additionally, uncertainty in the data poses difficulties, including sensor noise, calibration errors, incon- sistent data, and missing values Each modality exhibits variations in reliability and precision Aside from the TS3D model (initially proposed byGarbade et al.[2019] for handling indoor SSC tasks, then first used for demonstrating on SemanticKITTI dataset byBehley et al.[2019]), recent outdoor SSC methods have not tackled fusing multi-sensor data, likely due to these significant fusion difficulties versus utilizing a single data type like images.

Overcoming the fundamental challenges introduced by combining complementary but diverse sensor modalities remains an open research problem for multi-sensor outdoor semantic scene completion.

Despite the rapid advancements in RGB-based scene-completion techniques, methods utilizing LiDAR point clouds remain the primary choice for tackling the intricacies of 3D understanding, especially in expansive outdoor settings Although camera-based approaches offer a more cost- effective option, their performance still falls behind leading LiDAR models by more than 10 mIoU on benchmarks like SemanticKITTI.

The detailed geometric information LiDAR provides allows for explicitly modeling a scene’s 3D structure, facilitating completion and disam- biguation This imparts inherent advantages to LiDAR-based methods in accurately reconstructing layouts and surfaces, serving as crucial elements for comprehensive semantic completion In contrast, RGB methods currently struggle with inferring missing 3D information, resulting in less geometrically plausible outcomes.

The wide horizontal field of view of LiDAR ensures a comprehensive ob- servation of the entire scene, a critical aspect for successful completion. Conversely, RGB images cover only a frustum, making deficiencies out of view This partly elucidates LiDAR’s notable performance lead, particularly in classes such as "roads", "terrain", and "buildings", where RGB techniques must rely on hallucination without contextual support.

However, even the SOTA LiDAR-based models still struggle with small or highly occluded objects; for instance, "motorcyclist", "bicyclist", "person", and "truck" S3CNet shows significant robustness improvement for these classes; however, it has a high computational cost, making it impractical to apply in real-time applications Moreover, the uneven point density of the input point cloud also affects these models’ performance in distant, sparse regions These remaining challenges in the SSC task have motivated us to propose a novel solution to tackle these weaknesses and provide a robust, compact, and reliable SSC model.

To summarize, notwithstanding the high cost and sparse vertical resolution of LiDAR prompting research into more affordable and adapt- able RGB-based directions, LiDAR’s direct 3D sensing remains pivotal for achieving state-of-the-art semantic scene completion The advantages of sensing complete geometry and layouts sustain LiDAR’s dom- inance They will likely guide research until a significant technological breakthrough occurs in RGB-based models, narrowing the performance gap between the two Consequently, the following section of the thesis focuses on enhancing LiDAR point cloud-based approaches to address the remaining challenges of SSC tasks.

Motivation

Semantic scene completion (SSC) models have made significant strides in recent years, with recent 2D-3D approaches like SSC-RS, and SSA-

SC demonstrating improved performance which the fast inference time. However, critical vulnerabilities still plague these models, especially regarding far, sparse regions and information loss This exposes a need for innovative solutions to advance the field.

Specifically, SSC-RS and SSA-SC rely heavily on 2D BEV convolutional neural networks (CNNs), which often project 3D features to 2D BEV through pooling This leads to information loss in the reduced dimension Ad- ditionally, the translation invariant nature of CNNs with pooling layers makes these models susceptible to misinterpreting moving objects, like passing cars, as static - an obviously dangerous deficiency in real driving scenarios Finally, the black-box complexity of these models reduces the interpretability of their reasoning and compounds difficulties in error analysis or trust calibration.

To address these limitations, we propose a novel SSC model that lever-

28 4.PROPOSEDMETHOD ages a combination of techniques:

Cylindrical partition voxelization Instead of using uniform cubic voxels, we utilize cylindrical voxels whose dimensions vary based on the ra- dial distance from the sensor origin to solve the imbalance point distribution across sensor distance.

Cylinder-Cube Feature Mapping.Partitioning the 3D space into a voxel grid leverages the structured 3D representation and preserves spatial geometry, aiding the model in reasoning about occluded data However, the quantization process inevitably leads to information loss Moreover, special voxel partitions, such as cylinder partitions, can not be directly mapped to the BEV feature Hence, we introduce a Cylinder-Cube Fea- ture Mapping (CCFM) module to refine the cylinder features with the point-wise features to reduce the information loss from quantization and map these point-wise features to the BEV capsule in cube space, aiding the completion module further.

BEV Capsule embedding.BEV representations provide an effective way to encode spatial information and geometric structures for completing tasks in a compact 2D format By projecting the 3D point cloud data onto a top-down, overhead view, the BEV model can capture the layout, shapes, and relationships between objects and scene elements in a structured manner However, mapping 3D data to BEV 2D format leads to information loss to a certain amount, especially such as height variations or vertical structures Hence, we use grouped convolutional layers together with capsule embeddings for the completion model to capture the probability, pose, and directional information of all classes for each BEV pixel Capsule Network is also well-known for its translation equiv- ariance characteristics and ability to model hierarchical and part-whole relationships in data To keep these characteristics, we do not use any pooling layer for our BEV model.

By combining these innovative elements, our proposed model aims to overcome the limitations of existing SOTA approaches and achieve superior performance in semantic scene completion tasks In the following part of this section, we describe the detailed methodology of the pro-

Methodology

Overview

With our proposed SSC model, we try to resolve the entire scene from a sparse single frame of LiDAR point cloud This model tackles the limitations of existing approaches by combining three key innovations: Cylin- drical partition voxelization, Cylinder-Cube Feature Mapping module, and BEV Capsule embedding First, we provide a high-level overview of each component and its role in the pipeline Figure4.1illustrates the overall architecture of the proposed network.

First, the model receives raw input point clouds from the sensor These point clouds are processed through a MLP to extract point-wise features.These features are then used to generate initial cylinder and BEV fea-

30 4.PROPOSEDMETHOD tures The cylinder features are processed by an asymmetrical sparse 3D UNet, which provides both voxel-wise and point-wise semantic predictions Notably, the decoder part of the 3D UNet is activated only during training, enhancing the efficiency and speed of inference The BEV features are input into a 2D UNet to encode semantic and geometric details, with the encoder supported by semantic information from the 3D encoder via the CCFM module In the high-level latent space, a capsule network encodes column feature vectors, capturing pose, geometric, and other essential information for semantic scene completion Fi- nally, the 2D decoder utilizes the features extracted by the capsule network to produce a final 2D BEV feature map, which is then converted into the 3D voxel output prediction of the model.

In the remaining parts of this section, we will introduce each module in detail, explain how it functions, and describe its contribution to the overall model performance.

Point-wise feature extraction

Given the input point cloudP ∈R N ×3 in the range of [R x ,R y ,R z ] For each point, we calculate the cylinder coordinatesP c yl ∈R N × 3 by trans- forming the first two dimensions of the point coordinates to the po- lar coordinates and keeping the original final dimension The cylinder coordinatesP c yl are then used to calculate the voxel centroid position

P c yl cen ∈R M ×3 , where M is the number of occupied cylinder voxels The input point featuresF ∈R N × 10 are created by concatenatingd c yl , P c yl ,

P, andr, whered c yl are the distance from the points to their centroids, r is the remission information of the input point cloud Our model first extracts the point-wise featuref p ∈R N × 256 through several MLPs These features capture the local, low-level feature of the input point cloud and are used to create cylinder features and BEV features for the next stages.

Cylindrical partition voxelization

3D point clouds obtained from LiDAR sensors provide precise information for downstream 3D tasks, such as semantic segmentation and scene completion However, their main characteristic is the sparsity and varying density of the point distribution across the captured scene This is attributed to the diverging laser beams which result in a higher density of points in regions proximal to the sensor, with density decreasing quadratically as the distance from the sensor increases Consequently, adopting a uniform voxel grid subdivision of the 3D space, as employed by many contemporary approaches, renders an extremely imbalanced spatial quantization suboptimal for learning discriminative representations.

To elucidate, cubic voxelization operates under the assumption of a uniform point distribution, failing to account for the aforementioned varying density exhibited in LiDAR scans This results in a significant propor- tion of vacant voxels in farther ranges, constituting sparse inputs that hinder comprehensiveness during convolution operations Addition- ally, due to the uniform quantization, expansive voxels are allocated to densely populated regions, potentially causing injudicious sampling and loss of fine-grained structural details critical for robust scene understanding.

In contrast, the cylindrical partitioning scheme encodes the radially decreasing point distribution into its formulation A more balanced spatial subdivision is achieved by conforming the voxel dimensions to increase monotonically with the range from the sensor This equilibrium in voxel occupancies across the scene facilitates enhanced feature learning and mitigation of the sparse data predicament Furthermore, the angular decomposition around the sensor’s principal axis captures the rotational invariance ubiquitous in most outdoor environments, further regulariz- ing the learned representations.

Zhu et al.[2021] were the first to implement cylindrical partitioning in3D point cloud semantic segmentation, demonstrating significant performance improvements over conventional cube quantization methods.

Despite these advancements, current methodologies for the SSC task continue to rely on cube partitioning to match with the SSC output partition, thus failing to leverage the benefits of cylindrical quantization.

In this context, our research represents the first successful and efficient integration of cylindrical quantization into a multi-modal model We accomplish this by introducing a Cylinder-Cube Feature Mapping process (Section4.2.5), designed to address the intrinsic properties of point cloud data.

The extracted point-wise featuresf p are reassigned to obtain voxel features For this project, we formulate voxel featuresf c yl as: f c yl =ReLU(Li near( max p ∈ P c yl cen (f p ))) (4.1)

The cylindrical partitioning methodology not only yields a spatially coherent quantization addressing the varying density deficiency but also retains the intrinsic 3D structure within the point cloud data This cir- cumvents the dimensionality reduction shortcomings of conventional2D projections, which can lead to a severe loss of geometric context By preserving the 3D topology, the proposed partitioning scheme unlocks the potential for superior 3D reasoning within the deep neural architecture.

Asymmetrical Sparse 3D Unet

Following the cylindrical partition, an asymmetrical 3D convolution network is employed The Asymmetrical Sparse 3D UNet is a novel neural network architecture designed for efficient semantic segmentation of large-scale point cloud data This architecture leverages the cylindrical partition and asymmetrical 3D convolution networks to effectively handle the inherent challenges of outdoor LiDAR point clouds, such as sparsity and varying density These networks incorporate asymmetrical residual blocks, strengthening the horizontal and vertical kernels,thereby matching the point distribution of objects in driving scenes This design choice enhances the network’s ability to capture the salient fea-

33 tures of sparse and irregularly distributed point clouds, improving the overall representational power.

In detail, we employ the same encoder-decoder structure within this semantic segmentation branch, modified from the 3D UNet architecture utilized in Cylinder3D [Zhu et al., 2021], which is implemented using 3D sparse convolution As depicted in Figure4.2, both the asymmetrical downsample and upsample blocks in the network incorporate an Asymmetrical Residual Block By using the combination of asymmetric convolution layers, the model can capture the feature in the sparse point distribution regions more precisely while consuming 33% less memory and saving computational resources.

The Dimension-Decomposition Based Context Modeling Block explores

"high-rank" context by decomposing high-level context information into low-level features across three dimensions (length, width, height) By decomposing the data into multiple dimensions, DDCB captures these intricate relationships between voxels across the entire cylinder Subse- quently, it aggregates these low-level activations to derive comprehen-

34 4.PROPOSEDMETHOD sive context features The sigmoid function is applied to each extracted context information, allowing each channel to activate independently based on the strength of the corresponding feature in that voxel The progressive nature of DDCB helps the model learn these relationships efficiently, improving the overall effectiveness of the framework.

The Asymmetrical Sparse 3D UNet architecture incorporates a Point- wise Refinement (PR) module to alleviate the interference of lossy voxel- label encoding, a common issue in partition-based methods This module refines the voxel-wise outputs by leveraging both the point-wise features before and after the 3D convolution networks, mitigating the information loss caused by cell-label encoding.

Cylinder-Cube Feature Mapping

The proposed framework synergistically combines a cylindrical 3D encoder tailored for semantic feature extraction from unstructured point clouds and a Bird’s Eye View (BEV) 2D decoder adept at modeling structural geometric representations However, a key challenge lies in effectively bridging the modalities to leverage the rich semantic cues distilled by the 3D encoder while informing the 2D decoder’s geometric reasoning.

To this end, we devise a Cylinder-Cube Feature Mapping (CCFM) module that serves as an interface between the two branches, facilitating cross-modal feature transfer The CCFM module aims to map the high- dimensional semantic features encoded in the cylindrical voxel tensor to the 2D BEV feature space, thereby imbuing the decoder with an enriched semantic comprehension of the 3D scene.

Specifically, the CCFM module first projects the voxel-wise features from the cylindrical 3D encoderf c yl i ∈R M i ×d fori is the layer index, onto the point-level through a point-voxel mapping operation Each 3D point’s corresponding voxel feature is retrieved and assigned, effectively repli- cating the voxel semantics onto the sparse point locations These point- wise semantic features are then concatenated with the raw point fea-

Figure 4.3: Cylinder-Cube Feature Mapping (CCFM) module.

36 4.PROPOSEDMETHOD tures extracted at the beginning to form the joint point-wise featuref ′ i p ∈

R N ×d ′ i , amalgamating both high-level semantics and low-level geometric detail.

The concatenated point featuresf ′ i p are subsequently mapped to the 2D BEV space through an efficient max pooling and rasterization procedure to form the semantic aided BEV featuresf ′ i bev ∈R O×d ′ i bev This mapping leverages the inherent sensor geometry, accumulating point-level features onto a uniform BEV grid, yielding a structured BEV feature map that fuses 3D semantics with 2D geometric priors.

The resultant BEV feature map, enriched with semantic cues from the cylindrical encoder, serves as an informative prior for the subsequent 2D decoder network This cross-modal feature fusion enables the decoder to capitalize on high-level semantic representations while preserving its native geometric reasoning capabilities within the 2D BEV domain.

Moreover, the CCFM module introduces a point-wise refinement mechanism that mitigates the inherent quantization artifacts stemming from the lossy voxelization process By preserving fine-grained point-level features, this refinement step alleviates the detrimental effects of structural degradation, further enhancing the fidelity of the semantic representations.

In essence, the Cylinder-Cube Feature Mapping module establishes an effective cross-modal bridge, seamlessly integrating the strengths of both the cylindrical 3D encoder’s semantic prowess and the BEV 2D decoder’s geometric proficiency This synergistic fusion of modalities, facilitated by the CCFM module’s judicious design, paves the way for a holistic scene understanding that harmoniously combines rich semantics and structural geometries, advancing the overall semantic scene completion task.

Bird-eye View Capsule UNet

The BEV feature with semantics cues is a fast but powerful approach to partially resolving the environment occlusion with the overhead view.

However, BEV inherently reduces the dimensionality of the data from 3D to 2D, which can result in the loss of height information This dimensionality reduction might omit critical details about the vertical structure of objects, leading to less accurate scene understanding.

Hence, to enrich the feature representation of each flattened column, we incorporate a set of grouped convolutional layers to encode capsule embeddings with UNet structure to capture the scene’s geometric cues.

By introducing capsules into the UNet structure, the CapsNet module aims to extract features beyond mere presence or absence and encom- pass additional informative properties, such as the pose and the spatial orientation of objects within the scene, while keep the model lightweight and efficient using grouped convolutional layers.

Traditionally, convolutional neural networks (CNNs) process data using kernels that generate feature maps These feature maps encode spatial information but cannot represent the features’ orientation or pose within the data Capsule Networks address this limitation by employ- ing capsules, which are vector-valued representations that encode a fea-

38 4.PROPOSEDMETHOD ture’s presence and its instantiation parameters such as orientation or pose.

Figure 4.5: 2D Grouped Conv Capsule layer.

The BEV Capsule UNet first takes in the initial BEV features mapped from the pointwise features to form the primary capsules The capsule encoder then outputs the class capsules, which encode information for each class in the training targets These captured capsules are then fed into the decoder to give the prediction for the SSC task Specifically, after each grouped convolution in the encoder, the extracted capsules are passed through a routing function to vote for the next layer of capsules and activated with a function tailored specifically for capsules, called squashing This squashing function represents the probability that an entity exists based on the length of the vector so that active capsules can predict the instantiation parameters for higher-level capsules A squashing function must satisfy two key requirements: first, preserving the orientation of the input vector, and second, clamping its output length between 0 and 1 while penalizing for its actual length We use a modified version of the original squashing activation used in [Mazzia et al.,2021] that is referred to as the "dubbed squash" operation: squash(s l n ) à

∥s l n ∥, (4.2) wheres n l is a single capsule at the entrynof the layerl,S n l ∈R d

Self Attention Routing Traditional routing algorithms in capsule networks involve iterative processes, making them computationally expensive and impractical in real-time applications Hence, we utilize the self- attention routing mechanism from [Mazzia et al., 2021] Unlike traditional routing, self-attention routing is a single, feed-forward operation.This makes it computationally efficient and parallelizable The core of self-attention routing is an "agreement measure." This metric calculates how well a lower-level capsule’s activation vector aligns with the activation vector of a higher-level capsule Intuitively, capsules representing similar features should have a high agreement Based on the agreement measure, the self-attention mechanism assigns routing weights These weights determine how much influence each lower-level capsule has on the corresponding higher-level capsule Finally, the activations of lower- level capsules are weighted by the routing weights and summed to form the output of the higher-level capsule.

Loss function

To train the entire model end-to-end, our model is supervised by five loss functions as follows:

Capsules margin loss Margin loss, proposed by [Sabour et al., 2017], is mainly used to train the Capsule network and encourage the network to make clear-cut predictions It pushes capsule outputs for the correct class to have a high magnitude (strong activation), while suppressing those for incorrect classes (low magnitude) Denoted asL mar g i n , this loss is applied to the class capsules extracted by the Capsule UNet encoder Given the predicted label ˆy ∗ and the downsampled ground truth y ∗ , the loss is formulated as follows:

Focal loss In the SSC task, there’s a significant imbalance between the number of points in each class This imbalance can lead the model to prioritize learning easy examples and neglect harder examples To tackle this problem, we use Focal loss, an improvement over traditional cross-

40 4.PROPOSEDMETHOD entropy loss function with a modulating term, to supervise model training The additional term down-weights the contribution of easy examples during training and focuses the model on learning from hard-to- classify examples Given the SSC prediction of the model ˆy, the focal loss is formulated as follows:

L f oc al = −α ×(1−y)ˆ γ ×log( ˆy) (4.4) where theαdenotes the weight to balance the loss between different classes A higherαvalue increases the focus on the less frequent class.

(1−y)ˆ γ is the modulating factor that down-weights easy examples When ˆ y is high (confident correct prediction), the term approaches zero, reducing the impact of easy examples on the loss Conversely, when ˆy is low (misclassified), the term increases the loss, forcing the model to learn from those examples.γcontrols the degree of focus on hard examples Higherγvalues lead to a more significant down-weighting of easy examples.

Lovasz-softmax loss.The Lovasz-softmax loss functions [Berman et al.,

2018], denoted as L l ov asz is used to improve the optimization of the Intersection-over-Union (IoU) metric during training of neural networks. Lovasz-Softmax loss offers a solution by acting as a surrogate loss function It closely resembles the IoU metric but remains differentiable, allowing for efficient optimization using gradient descent algorithms to train the entire networks The core idea is the Lovasz extension, which transforms the submodular set function associated with IoU into a smooth and convex function This transformation allows for efficient optimization using gradient descent.

Semantic loss For the 3D semantic segmentation branch, both focal loss and Lovasz-softmax loss are used on the voxel-wise and point-wise prediction.

L sem = α ×(L pw s f oc al +L v w s f oc al )+β×(L pw s l ov asz +L v w s l ov asz ) (4.5) whereαandβare two hyperparameters used to control the influence of the focal loss term and Lovasz-softmax term Thepw s stands for point- wise semantic andv w sstands for voxel-wise semantic.

Completion loss For the 2D BEV completion branch, focal loss and Lovasz-softmax loss are used on the final SSC prediction of the model, and the margin loss is used to train the extracted feature of the capsule network.

L com = λ 1 ×L com f oc al + λ 2 ×L com l ov asz + λ 3 ×L mar g i n (4.6) whereλ i is used to control the influence of each loss term.

Datasets

The SemanticKITTI dataset, introduced byBehley et al.[2019], provides a large-scale benchmark specifically tailored for semantic scene completion using real-world LiDAR data.

The SemanticKITTI dataset is based on the popular KITTI Vision Bench- mark suite and provides dense semantic annotations for all LiDAR scans across 22 driving sequences in urban and highway environments A to- tal of over 43,000 scans with 25 object classes were manually annotated, resulting in over 4.5 billion labeled 3D points Importantly, sequential scans in the dataset capture different views of objects as the vehicle moves, providing ground truth information to evaluate scene completion approaches.

By projecting and accumulating future LiDAR scans in a region ahead of the vehicle, dense voxel representations can be generated as input- target pairs for semantic scene completion The inputs represent the visible partial scene, while the targets are the completed scene with full object geometries and semantic labels For benchmarking, 19,130 train-

44 5.EXPERIMENT ing pairs and 3,992 test pairs of voxel grids were extracted from the Se- manticKITTI sequences.

The availability of this large-scale real-world dataset enables the development of learning-based approaches for semantic scene completion on LiDAR data Sequential information and object motion also allow the modeling of temporal dependencies As evidenced through baseline experiments in the dataset paper, this task has significant scope for progress The SemanticKITTI dataset is, therefore, poised to spur ad- vances in semantic scene completion research using LiDAR point clouds in self-driving and other robotics domains.

Metrics

A primary metric is theIntersection over Union (IoU)between the predicted and ground truth completed scenes As scene completion entails predicting occupancies of voxels, IoU is computed based on whether each voxel is correctly classified as occupied or empty, ignoring the semantic labels This evaluates the geometric completeness.

Additionally,mean Intersection over Union (mIoU)across semantic classes evaluates both correct voxel occupancies and semantics Along with overall accuracy, mIoU is the main metric to judge semantic scene completion algorithms The set of classes matches those defined for single- frame semantic segmentation to enable comparative assessments.

To provide fuller insights,per-class IoUsis also reported This will provide detailed insights on the model performance for each class in the dataset.

Training setting

Experimental Environment All experiments in the section are conducted on Intel(R) Xeon(R) Gold 5315Y CPUs @3.20GHz, coupled with

Evaluation

Quantitative results

We compare the test performance of our proposed model with the state- of-the-art methods and our baseline on this challenging outdoor semantic scene completion task.

We evaluate the test performance of our proposed model against state- of-the-art methods and our baseline on this challenging outdoor semantic scene completion task.

The quantitative results indicate that our method achieves the highestIoU score and ranks among the top three methods in terms of mIoU,

Method FPS IoU mIoU ca r bicy c le mot o rcy cle tr u c k oth er -v ehicle person bicy c list motor cy clist road pa rking sidew a lk oth er -gr o und bui lding fe n ce v eget at ion tr u n k te rr ain pole tr affi c-sign

JS3C-Net [Yan et al.,2021] X 56.6 23.8 33.3 14.4 8.8 7.2 12.7 8.0 5.1 0.4 64.7 34.9 39.9 14.1 39.4 30.4 43.1 19.6 40.5 18.9 15.9 S3CNet

SSC-RS [Mei et al.,2023] 27.7 59.7 24.2 36.4 10.1 5.1 5.3 11.2 4.7 2.4 0.9 73.1 38.6 44.4 17.4 44.6 30.8 44.1 26.0 41.9 15.0 7.2 SCPNet [Xia et al.,2023] X 56.1 36.7 46.4 33.2 34.9 13.8 29.1 28.2 24.7 1.8 68.5 51.3 49.8 30.7 38.8 44.7 46.4 40.1 48.7 40.4 25.1 SSA-SC [Yang et al., 2021]

Table 5.2: Quantitative results of semantic scene completion algorithms on Se- manticKITTI test set The red, blue, and brown values denoted the top 1, top 2, and top 3 best results, respectively, for all methods The results of all models, except our proposed model, are collected from their original papers The FPS is only recorded for the model that we successfully reproduce and run on an NVIDIA RTX A6000. outperforming the baseline model by 1.1% and 1.5% for IoU and mIoU,respectively Regarding per-class IoU, SCPNet demonstrates superior performance for nearly all classes, benefiting from the multi-scan model teacher, which enables it to acquire knowledge from a significantly denser input point cloud The S3CNet, aided by local geometric loss, also performs remarkably well, particularly for challenging and minor classes such as "bicycle," "motorcycle," "person," and "bicyclist." However, our model still outperforms 3D-2D models, such as SSC-RS and, notably, our baseline, for nearly all classes while exhibiting significantly faster runtime and easier training compared to 3D models like S3CNet and SCP-Net.

Qualitative result

We provide the visualizations on the SemanticKITTI validation set as illustrated in Figure5.1 The first column shows the input point cloud of the sequence, the second column shows the label given by the dataset corresponding to the input scan, and the three last columns show the

F ig u re 5.1 : Q uali tativ e re su lts of th e pr op o sed model on th e S eman ticK T T I v ali da tion set

48 5.EXPERIMENT predicted label of our proposed method, our baseline SSC-SA, and another 3D-2D mix model SSC-RS.

The initial scan (first row) indicates our model’s overall performance exhibits high accuracy in the driving regions, even with noisy label data, and acceptable performance in other regions All three models evalu- ated tend to over-label the "pole" and "traffic-sign" classes (respectively depicted in light yellow and red) One potential explanation for this phenomenon is that all three models effectively capture the contextual relationship between these two classes, leading the model to infer the presence of a "traffic-sign" on a "pole" even in instances where only the

"pole" instances are present Regarding the noisy moving object labels, marked with red circles, our proposed method demonstrates robust reconstruction of the "car" instance (illustrated in blue), while our baseline, SSA-SC, produces a noisy label Although the SSC-RS model does not generate a noisy label, the shape of the "car" instance is inaccurate.

The second scan exemplifies the efficacy of the capsule network when supplemented with pose and direction information For the regions demarcated with red circles, our proposed method accurately reconstructs and labels the "trunk" instances (depicted in brown), while other models misclassify nearly all of these instances as "pole" and "traffic-sign." For the regions highlighted by cyan circles, although all three models can reconstruct two "car" instances, our model exhibits superior ability in distinguishing between these two instances Furthermore, in the regions marked with green circles, our model does not over-reconstruct invalid voxels.

The third scan demonstrates the efficacy of the cylindrical partition in handling the varying density of the input point cloud For the distant, sparse regions demarcated by red circles, our model successfully reconstructs all the "car" instances and even a small occluded "other-vehicle" instance (depicted in dark blue), while the other models fail to predict this "other-vehicle" instance This highlights the potency of the cylindrical partition in addressing the varying point distribution nature inherent to LiDAR sensors.

Ablation studies

Định dạng
Số trang	70
Dung lượng	29,08 MB