Point cloud compression for humans and machines

21 3 Proposed residual method in scalable point cloud coding for humans and machines 23 3.1 Base branch.. However, in the digital era, point clouds are also used to provide information f

Context

Extended Reality (XR) technologies have advanced significantly, impacting daily life Point clouds, a common 3D data type, are crucial for XR applications like virtual reality (VR) and mixed reality (MR) Point clouds consist of discrete points with spatial coordinates (x, y, z) and attributes like color and normals They can be classified as static or dynamic based on the inclusion of a temporal dimension.

In most applications, the number of points in realistic and immersive point clouds easily ranges in the order of millions, and the attributes are also complex Therefore, if we transmit uncompressed point cloud data, it will exceed the bandwidth of most current communication systems For example, streaming a dynamic point cloud without compression requires a bandwidth of 3.6 Gbps [7] Thus, effective and efficient point cloud compression methods are essential to enable the use of point clouds in practical applications. Due to the substantial increase in 3D point cloud data and the demand for an efficient compression

XR, encompassing Virtual Reality (VR), Mixed Reality (MR), and Augmented Reality (AR), has emerged as a crucial technology for immersive applications To facilitate efficient data transmission, point cloud compression has become a focus of standardization efforts by the Moving Picture Experts Group (MPEG) MPEG has proposed two standards: Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC) G-PCC and V-PCC leverage traditional approaches, such as octree decomposition or triangulated surface models for G-PCC and 3D-2D projection for V-PCC.

Recent advancements in deep learning have led to the development of powerful image/video compression methods that outperform traditional techniques like JPEG2000 and rival the efficiency of HEVC Building upon this success, researchers have extended deep learning to point cloud compression, achieving significant improvements over MPEG standards.

Traditional point cloud compression focuses on visual reconstruction for humans, but the increasing use of point clouds in machine tasks like scene understanding and self-localization necessitates compression methods optimized for machine perception This requires improving existing methods or developing new ones that prioritize machine-readable information while still supporting "human in the loop" scenarios where visual content is important for human users.

Motivation, objective and contribution

Motivation

Point cloud classification is crucial for various applications, including surveillance, autonomous driving, environmental monitoring, and augmented reality It identifies and labels objects within point clouds, enhancing security in surveillance, enabling scene understanding for autonomous vehicles, and facilitating terrain analysis in environmental monitoring.

Augmented reality experiences rely on accurate object classification for realistic virtual object placement and interaction In real-world applications, where bandwidth and storage are limited, compressing point clouds efficiently is crucial for meeting the specific requirements of the application.

Precise geometric reconstruction is crucial for high-quality rendering and interactive point cloud applications While lossless compression methods struggle to achieve efficient compression (around 1 bit per voxel), lossy compression offers significant benefits by reducing storage and transmission costs for large point clouds, particularly when tailored to specific rate/quality requirements.

Dense point clouds are needed for human visualization to achieve a satisfactory rendering quality [20]. However, most studies on machine vision tasks, such as point cloud classification, use sparse point clouds

[21, 22] This is because sparse point clouds have enough information to achieve high performance for machine vision tasks and significantly reduce computation costs.

The JPEG Pleno PCC projects, initiated in response to the potential of deep learning-based coding, aim to develop a new coding standard for point clouds and their attributes This standard will leverage deep learning techniques to create a single, compact compressed domain representation that supports advanced data access functionalities The project targets both human visualization, aiming for high compression efficiency, and effective performance for 3D processing and machine vision tasks.

In collaborative scenarios [24], where edge devices send data to the cloud for processing, transmitting only the essential information for tasks is efficient Generating distinct representations for each subset of tasks becomes impractical with an increasing number of tasks Additionally, if information for some tasks has already been sent, and a broader set of tasks is needed for the same input, sending a new representation would involve redundant information [25] Thus, organizing task information in a scalable fashion should be an efficient way [26], sharing base representations among tasks and needing only extra information for more specific tasks.

Objective

Building upon the motivations outlined above, this thesis explores a learning-based coding solution for static 3D point clouds, aiming to generate representations suitable for both machine vision tasks and human visualization Our initial focus is on investigating dense point cloud geometry data Subsequently, we start to optimize the performance of point cloud classification tasks within an appropriate bitrate range.Lastly, leveraging a scalable approach, we aim to efficiently distribute the transmitted information for two applications: point cloud geometry compression and point cloud classification.

Contribution

Our key contributions can be summarized as follows:

This article presents a novel codec architecture specifically designed for classifying dense point clouds Its compression efficiency is evaluated using a voxelized ModelNet10 dataset at various resolutions, revealing its adaptability and performance characteristics.

This article introduces a scalable framework for both compressing dense point cloud geometry and classifying point clouds The framework leverages effective bit distribution, making it versatile and widely applicable We evaluated its performance by comparing it with leading methods in the field, demonstrating its effectiveness.

Thesis organization

This thesis investigates point cloud geometry compression and classification, presenting a novel scalable framework for both tasks Chapter 2 reviews existing literature and explores emerging coding trends, while Chapter 3 details the proposed framework Chapter 4 showcases the framework's performance through experiments and results, and Chapter 5 concludes the thesis with suggestions for future research.

Background and state-of-the-art

In this chapter, we provide an overview of state-of-the-art methods for point cloud geometry compression and point cloud classification After that, we will review the recently emerged topic of coding for humans and machines.

Point Cloud Compression

Traditional Approaches

Point cloud compression methods are traditionally categorized into 1D traversal, 2D projection, and 3D decorrelation methods 1D traversal methods exploit geometric correlations to predict neighboring points, as demonstrated by Gumhold et al.'s (2005) tree-based prediction approach 2D projection methods convert 3D data into 2D images or videos, leveraging image or video encoding techniques for compression.

3D decorrelation methods leverage the full 3D correlations in point cloud data, utilizing techniques like tree-based, level-of-detail-based, and transform-based approaches Despite the variety of existing point cloud compression methods, a comprehensive summary and objective evaluation are needed to establish an effective compression standard.

MPEG has been developing point cloud compression standards since 2014, resulting in two finalized standards: Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) V-PCC uses a 3D-to-2D projection approach and image/video coding methods, making it a short-term solution but currently the best for dense point clouds G-PCC, a long-term solution, leverages an octree/trisoup geometry codec to directly address 3D information MPEG has initiated the development of G-PCC v2 with improved compression tools, and this section focuses on analyzing the G-PCC method.

G-PCC, stands for Geometry-based Point Cloud Compression, this is a compression standard proposed by MPEG [35] The G-PCC standard provides a compression framework for various types of point clouds. This compression framework consists of two main components: the encoder and the decoder In this study, our focus lies on geometry coding Therefore, we concentrate on introducing the blocks utilized for geometry coding The G-PCC encoder comprises three key components, including preprocessing, transformation block, and entropy coding In the preprocessing step, the coordinates are transformed using a scaling parameter, followed by point quantization and removal of duplicate points In the transformation block, two methods are provided including octree analysis and predictive tree encoding The transformation block is a crucial component that handles the transformation of the preprocessed data This block involves converting the data into an efficient representation, enabling more effective compression Finally, the entropy coding block is applied for the final encoding of the transformed data Entropy coding techniques,such as Huffman coding or arithmetic coding, are commonly used to achieve high compression rate by exploiting statistical redundancies in the data In the decoder part, three main components are also provided for reconstructing point cloud data from compressed representations Detailed information on G-PCC can be found in MPEG standard documents [35, 36].

Learning-base Approaches

Recent studies have leveraged learned models for point cloud compression, drawing inspiration from image data compression Voxel-based methods, utilizing variational autoencoders or autoencoders, have been widely explored for point cloud geometry compression Recognizing the sparse nature of point clouds, research has also explored point-based approaches to optimize computational complexity.

Learning-based point cloud compression methods utilize various input representations, including voxels, octrees, and point-wise representations Octree-based approaches, like those presented in [40, 41], offer high compression performance exceeding the G-PCC standard while maintaining comparable quality to V-PCC [38, 39].

Octree representation, commonly used for lossless compression, can be adapted for lossy compression by quantizing geometry before compression While Huang et al leverage MLPs to predict occupancy probabilities, this approach faces challenges like the need for additional bits to represent deep tree hierarchies and inefficient resource usage due to equal treatment of empty and occupied voxels.

Point-wise compression directly compresses raw points from the input point cloud, often using point-based backbones like PointNet or PointNet++ and MLP-based autoencoders for efficiency While this approach excels in reducing redundant calculations, it struggles with high bit rates and generalizing to large-scale point clouds.

Voxel representation, which converts point cloud data into a 3D grid, excels at capturing spatial correlations, making it ideal for point cloud compression This approach leverages existing 3D image compression techniques and enables efficient geometric detail capture Three prominent voxel-based methods, GEO-CNNv1, GEO-CNNv2, and PCGCv1, will be reviewed in this article.

GEO-CNNv1 is a groundbreaking study that utilizes Convolutional Neural Networks (CNNs) for compressing point cloud geometry The approach involves treating the point cloud as a binary signal within a 3D voxel grid This process starts by voxelizing the input point cloud (X) at a specific resolution (r) Subsequently, the voxelized point cloud is viewed as a set of points within the voxel grid.

Ω r ≜ [0 r] 3 The input for the model is v S , which is the voxel grid representation of S An autoencoder architecture is used to transform the input (v S ) into a lower dimensional latent space representation (Y) and reconstruct the output (X) equivalent to the input A quantization block isb used in the bottleneck of the autoencoder The quantization step is essential to enable the efficiency of the entropy coding performance In the testing phase, we can use the simple rounding operation for the quantization process (i.e.,Yb =⌊Y⌉) However, the differentiability in backpropagation should be addressed in the training phase [8] Therefore, an approximate rounding operation is used by adding uniform noise to the latent space The decoding processing is now considered as a binary classification problem where each pointz ∈Ω r is either present or not The output of the decoder (bv S ) is decomposed into individual voxelzwith the probability valuep z To train the model, the focal loss is utilized to handle the class imbalance problem where the number of empty voxels is often larger than the number of occupied voxels In this study, theα-balanced focal loss is used and can be defined as follows:

The focal loss parameter, γ, differentiates between easy and hard examples during training The distortion loss, used to train the model, is calculated using the formula (2.1) where pz and α are defined as pz and α if S(z) = 1; otherwise, they are defined as 1 - pz and 1 - α.

The rate in the number of bits per point (bpp) (R) is used in the final training loss function for constraint with the bit rate In the training phase, the entropy of the quantized latent space (Yb) is calculated using the differential entropy method and is defined asR The final loss function is

GEO-CNNv2 builds upon GEO-CNNv1, introducing enhancements for point cloud compression Key improvements include a deeper transform, an enhanced hyperprior model for entropy modeling, optimized focal loss weights, an optimal decoding threshold, and sequential training for reduced computation costs The octree partitioning algorithm accelerates point cloud partitioning, while the training process utilizes the same rate-distortion control mechanism as GEO-CNNv1 with the λ parameter.

PCGCv1 leverages a Variational Autoencoder (VAE) architecture, similar to GEO-CNNv2, for efficient point cloud geometry compression The process involves voxelizing and partitioning the point cloud into non-overlapping 3D cubes, which are then fed into a 3D convolutional network This network generates compact latent space representations and hyperpriors, which enhance the conditional entropy model for efficient latent space coding The architecture employs the Voxception-ResNet (VRN) structure, combining ResNet blocks and an Inception block, for improved coding efficiency VRN addresses vanishing gradients through ResNet and extracts features at different scales using Inception For model training, PCGCv1 uses a similar approach to GEO-CNNv2, employing Weighted Binary Cross-Entropy.

(WBCE) instead of focal loss to compute distortion during training WBCE is defined as follows:

X−log (1−p e x n) (2.3) wherep x e o andp e x nrepresent the probability value of the occupied voxels and null voxels, respectively.

N o and N n are the number of occupied voxels and null voxels The hyperparameterαis used to balance the loss penalty between positive and negative samples.

Sparse tensor processing: Recently, this approach has been employed in point cloud compression. Sparse tensors, consisting of coordinate information (C) and associated features (F), are leveraged to represent point cloud data efficiently The sparse tensor processing approach is crucial for managing the irregular and sparse nature of point clouds, where a small number of points in space contain data This method reduces computational complexity by focusing on the coordinates and features of occupied voxels. Using sparse tensor processing enhances the overall efficiency and effectiveness of point cloud compression methods Next, I will provide more detailed information about a method based on this approach called PCGCv2 [44].

PCGCv2 significantly improves upon its predecessor, PCGCv1, by optimizing point cloud geometry compression for both efficiency and inference time This advancement is achieved through a multiscale neural network that progressively re-samples the point cloud for reconstruction, coupled with a sparse convolution-based autoencoder network This network, built upon Inception-Residual Network (IRN) units, efficiently extracts features while reducing processing time and memory usage The model leverages binary classification with hierarchical reconstruction to measure distortion, effectively classifying occupied points in generated voxels using binary cross-entropy loss.

The distortion loss is calculated using the formula -(xilog (pi) + (1−xi) log (1−pi)), where xi represents the voxel label indicating occupancy or nullity, and pi is the probability of the voxel being occupied The final distortion loss, D, is the sum of the binary cross-entropy (BCE) loss for each voxel, weighted by M, which represents the total scale output of the encoder.

Point Cloud Classification

Point cloud classification involves assigning labels to individual point cloud objects, such as identifying objects like buildings, trees, and cars Two common approaches for feature extraction in point cloud classification are handcrafted features and deep learning-based methods [12].

While handcrafted features can capture relevant characteristics of point cloud data, their effectiveness is dataset-specific Deep learning methods, which learn features directly from raw data, often outperform handcrafted approaches, especially for complex datasets This article explores two common data handling approaches in point cloud classification: point-based and sparse tensor-based methods.

Point-based approaches analyze each data point in a point cloud individually, leveraging neural networks like PointNet and PointNet++ to characterize them.

PointNet is a neural network architecture specifically designed for analyzing point cloud data To ensure invariance to transformations like rotations and translations, PointNet utilizes a shared Multi-Layer Perceptron (MLP) with shared weights for each input point This shared MLP structure allows PointNet to learn local features independently for each point These local features are then aggregated to capture global patterns within the point cloud, ultimately generating a global feature vector This global feature vector serves as a representation of the entire point cloud, enabling tasks like 3D object classification.

• PointNet++is an extension of the original PointNet architecture It is designed to address some limitations and enhance the processing capabilities for point cloud data [22] PointNet++ introduces a hierarchical structure to capture local features at multiple scales It uses a set abstraction and feature propagation layers, allowing the network to focus on fine-grained and coarse-grained bot details in the point cloud Set abstraction layers are used to sample representative points at different scales Each set abstraction layer includes a local point grouping mechanism and an MLP (Multi- Layer Perceptron) to generate features for each group This process helps to capture local structures effectively The hierarchical structure of PointNet++ generates intermediate features at different levels of abstraction These intermediate features are crucial for capturing local details and maintaining a global understanding of the point cloud.

Sparse convolution base methods leverage sparse tensors to efficiently process point cloud data, reducing memory usage and accelerating computation This approach is particularly advantageous for point cloud classification, as demonstrated by our architecture, which utilizes the MinkowskiEngine library for optimized sparse tensor operations and is based on the PointNet architecture.

Input PC Conv 64x3 3 Batchnorm + ReLU Conv 64x3 3 Batchnorm + ReLU Conv 64x3 3 Batchnorm + ReLU Conv 128x3 3 Batchnorm + ReLU Conv 1024x3 3 Batchnorm + ReLU Max pooling MinkowskiLinear Batchnorm + ReLU Dropout MinkowskiLinear

C=3 Cd Cd Cd C8 C24 Nx1024 Nx256 Nx10 conv1 conv2 conv3 conv4 conv5 max_pool linear1 linear3

MinkowskiLinear Batchnorm + ReLU Nx512 linear2

Figure 2.1: MinkPointNet Architecture The classification network takes a sparse tensor as input.

• MinkPointNet is inspired by PointNet architecture that uses convolution operators for feature extraction and a set of MLP layers for classification Figure 2.1 displays the architecture of representing the coordinate values of points within the point cloud data and featuresF X containing similar information toC X However,F X is normalized by subtracting the mean value and dividing it by the standard deviation value based on the value of each axis The tensorXpassed through a series of encoder blocks, each comprising a sparse convolution layer, a batch-normalized layer, and a ReLU activation Following these encoder blocks is a max-pooling block used to generate a feature vector. The feature vector then passes through an MLP block containing a fully connected layer followed by a batch-normalized layer and a ReLU activation Subsequently, a dropout layer is applied to prevent overfitting issues Finally, the network output is a vector of logits with the size of the number of classes.

Coding for Humans and Machines

Visual content is essential for both human and machine vision in the digital age Modern image and video compression methods prioritize both reconstruction quality and efficiency for machine vision tasks, a concept known as "coding for humans and machines."

Recent research in image and video coding utilizes neural networks for compression, aiming to balance high-quality reconstruction and improved performance in machine vision tasks Choi and Bajic (2022) developed an end-to-end learned codec that optimizes information allocation for specific tasks, with a scalable architecture encoding latent space for varying task complexities Their 2-task model employs separate bitstreams for base and enhancement layers, allowing computer vision tasks to utilize only the base layer for optimal performance while reconstruction requires both layers This channel separation, however, introduces redundancy in the representation for reconstruction.

Andrade et al (2023) proposed two efficient coding techniques, residual and conditional coding, for reconstruction tasks Conditional coding utilizes an enhancement representation containing all reconstruction information, while the base representation handles uncertainty resolution during coding In contrast, residual coding subtracts the base representation from the enhanced representation before coding and adds it back after decoding Both approaches exhibit comparable performance.

Point cloud compression techniques, inspired by image/video coding, aim to optimize for both human and machine vision Methods like Liu et al (2023) utilize different compression strategies based on the task, but suffer from redundancy when used simultaneously Ma et al (2023) address this by aggregating multitask features, achieving improved compression efficiency but still generating redundant bits for machine vision tasks These methods, though promising, require further optimization for bit rate efficiency.

This approach employs nonspecialized codecs to classify point clouds The point cloud is first encoded into a bitstream, then decoded to reconstruct the cloud The reconstructed point cloud is then fed into a classifier, such as PointNet or PointNet++, to obtain the class prediction.

Recent research has explored the use of standard codecs for compressing latent features in machine vision tasks Ulhaq and Bajic (2023) introduced a point cloud compression codec tailored for single-machine vision tasks, drawing inspiration from the Information Bottleneck concept to define the loss function and network architecture.

• Learned point cloud compression for classification[50] is introduced to address the problem, which is the limited computational capabilities of end devices Based on the concept of Information Bottleneck [51], authors proposed the architecture, consisting of three main parts: analysis transform, entropy coding model, and synthesis transform In the architecture, the input point cloud X is encoded by analysis transform block (g a ) to generate the latent spaceY =g a (X) Then, the latent space is quantized asYb =Q(Y)and passes through a learned entropy coding to get the bitstream. The bitstream is decoded for reconstructed latent space Finally, the reconstructed latent space is used as the input of the synthesis transform block to predict the classesTb=g s (Yb).

Summary

Through the literature review, we have summarized some key information that serves as evidence for the proposed scalable point cloud coding for humans and machines.

This study explores the use of sparse tensors for compressing point cloud geometry, focusing on improving the efficiency of human visualization tasks While voxelized-based methods currently offer the best performance, they face computational limitations Sparse tensors, on the other hand, can process the entire point cloud data, significantly reducing computational costs compared to 3D convolution-based methods We leverage the architecture of PCGCv2, a network based on sparse tensor processing, as the main codec for our approach.

Point cloud classification is a critical task in machine vision, and many existing methods rely on sparse point clouds with a limited number of points.

This research explores the application of the residual method for multi-task scalability in visual content compression, specifically focusing on point cloud classification and geometry compression We propose a framework that leverages the residual method's effectiveness to efficiently generate representations suitable for both machine vision and human visualization tasks While existing methods like PointNet show acceptable performance, our framework demonstrates marginal improvements (2-3%) in performance, indicating its potential for optimizing these tasks.

Proposed residual method in scalable point cloud coding for humans and machines

This chapter introduces a scalable point cloud coding framework built on the residual method, designed for both human visualization and point cloud classification The framework utilizes two branches: a base branch optimized for classification performance and an enhancement branch that leverages the residual mechanism to generate an efficient bitstream for reconstructing dense point cloud geometry.

Reconstructed: bathtub: 1% bed: 1% chair: 95% table: 1%

Figure 3.1 illustrates the architectural design of the residual method used in scalable point cloud coding, specifically for human and machine applications The enhancement network operates independently of the base network, as indicated by the dashed line, emphasizing their distinct roles in the coding process.

The diagram in Figure 3.1 illustrates the scheme of the proposed method In this scheme, leveraging the efficiency of PCGCv2 for the point cloud geometry compression task, we retain the network architecture of PCGCv2 to generate the latent space information Subsequently, we applied the residual method to efficiently distribute bits for two tasks: point cloud classification (base branch) and point cloud geometry compression (both base branch and enhancement branch).

Base branch

To optimize point cloud compression efficiency for classification tasks, we leverage the latent space of PCGCv2, which can reconstruct dense point clouds This latent space, containing sufficient information for classification, offers a downsampled representation of the original data, reducing computational complexity Consequently, we develop a task-specific codec that utilizes the latent space as input.

Based on the work of Ulhaq and Bajic (2023) [50], the concept of Information Bottleneck is commonly employed in designing a codec for an efficient task-specific (T) and defined as follows [51]: min p( Y b |X)

I(X;Yb)−βãI(Yb;Tb), (3.1) wherep(Yb |X)the mapping from the inputX to latent representationYb,I(ã;ã)is the mutual information, andβ >0is information bottleneck lagrange multiplier In this study, instead of using the input point cloud

Our approach leverages the latent space Y as input for a task-specific codec, resulting in a latent representation Z at the bottleneck This Z is then quantized, creating Z' We modify the Information Bottleneck principle by minimizing the conditional probability of Z given Y, denoted as p(Z|Y).

Understanding p(Zb|Y) as a process of feature extraction and quantization implies that knowledge of Y fully determines Zb This means there's no remaining uncertainty or randomness in Zb once Y is known, making Zb a deterministic function of Y.

The information bottleneck concept aims to minimize the mutual information between the input Y and the compressed representation Z, while maximizing the mutual information between Z and the output label Tb This translates to minimizing the conditional entropy H(Zb|Y) and maximizing H(Zb), leading to a compressed representation Z that retains relevant information about the output label By using a classification loss function D(T,Tb) as a proxy for the mutual information between Z and Tb, we can optimize the model to achieve higher accuracy in the classification task.

The information bottleneck concept mirrors the loss function in learnable compression methods, inspiring the development of a task-specific codec This article will delve into the specifics of this task-specific codec.

Latentspace Conv 64x3 3 Conv 64x3 3 Conv 64x3 3 Conv 128x3 3 Conv 1024x3 3 Max pooling

Figure 3.2: Proposed codec for dense point cloud compression for classification The input of the codec is the latent space of the PCGCv2.

Figure 3.2 illustrates the architecture of the proposed codec The classification backbone is designed based on the MinkPointNet architecture previously presented, and the last layer of the backbone is a global max pooling block to generate the feature vector Subsequently, the feature vector is element-wise multiplication with a trainable gain vectorv ∈R 1024×1 , similar to the approach employed in the work byUlhaq and Bajic [50] The feature vector is multiplied by a constant scalar value of 10 for enhanced stability and convergence On the decoder side, a set of pointwise convolutional layers with kernel size 1 is used to create a block equivalent to "share MLP," which is used in the PoitnNet [21].

Synthesis transform for latent space representation of base branch (h r (.))

To integrate the base branch's 1024x1 feature vector (Zb) with the latent space features (Y), we utilize a residual method This involves resizing Zb to match Y's dimensions We achieve this by applying a repeated technique to restore the feature vector's size to that of the post-max pooling layer Subsequently, we use sparse convolutional layers, similar to the classification backbone, to generate the base branch's latent space representation.

Conv 128x3 3 Conv 64x3 3 Conv 64x3 3 Conv 64x3 3 Conv 8x3 3

Figure 3.3: The process of transforming the feature vectorZbfor applying in the residual method using the synthesis transform blockh r (.).

The diagram in Figure 3.3 illustrates the process of transforming the feature vectorZbto adapt it for the residual method In the classification backbone module, we use sparse convolutions with a stride of

This technique leverages the global max-pooling layer's information preservation to modify the size of the feature vector Zb while preserving its original information By repeating each value in Zb based on the size of the latent space Y, we create a feature with a size of 1024×P Y, effectively restoring the feature's original size before the global max-pooling layer.

C Y to create a sparse tensor then fed into a set of sparse convolutions to generate the latent representation of the base branch.

Enhanment branch

Problem formulation

This article explores two compression methods for a latent space Y: base representation (Zb) and enhancement representation (Zr) Base representation minimizes distortion for a specific machine vision task (T) using a compression method and a learnable decoding function (gb) Meanwhile, enhancement representation minimizes distortion for human visualization using a compression method, a learnable decoder (ge), and a residual scheme that subtracts the base representation (Yb) from the original latent space (Y).

The architecture of analysis and synthesis transform of the enhancement branch

Conv 32x1 3 ReLU Conv 32x3 3 ReLU Conv 32x1 3 ReLU

Figure 3.4: The architecture of the analysis and synthesis transform of the enhancement branch is built upon the Residual Bottleneck Block (RBB) as the fundamental unit.

The architecture used for the learned scalable compression, introduced in the study by Andrade et al.

(2023) [25], is a simplified version of ELIC [52] In this architecture, stacked residual blocks are employed as a nonlinear transform instead of generalized divisive normalization (GDN) In the study by He et al.

(2022) [52], it has been demonstrated that the use of stacked residual blocks improves performance by enhancing nonlinearity Stacking residual blocks also contributes to better scalability in model profiling In this study, we adopt a simplified architecture based on the work of Andrade et al (2023) [25] Figure 3.4 illustrates the detailed architecture of the analysis and synthesis blocks.

Entropy rate modeling

Entropy coding plays a crucial role in compressing the source to leverage statistical redundancy Among various methods, arithmetic coding stands out for its good performance and is widely adopted in standards and products Therefore, we use arithmetic coding to compress each element of the quantized latent representation.

The entropy bound of the source symbol, such as a latent space representation, is directly related to its probability distribution This connection is crucial for accurate rate estimation, which is essential for rate-distortion optimization To estimate the bit rate of the quantized latent representation (Yb), we can utilize an approximation formula.

, (3.4) wherep Y b is self probability density function (PDF) ofYb.

The rate calculation differs between training and evaluation During evaluation, the rate is determined by the quantized latent representation's bit count, resulting in a discrete probability distribution In contrast, during training, uniform noise is added, enabling differentiation and leading to a continuous probability distribution This difference in distributions means training employs differential (continuous) entropies, while evaluation utilizes discrete entropies.

This research was conducted on a workstation running the Ubuntu 20.04 operating system, equipped with an Intel Core i9-10900K CPU, 64GB of RAM, and an RTX 3090 GPU with 24GB of VRAM The point cloud compression frameworks were implemented in a Python 3.8 environment, utilizing the PyTorch 1.8.1 deep learning framework with CUDA 11.1.

Dataset

ModelNet10, a popular dataset for 3D object classification, is a subset of the larger ModelNet dataset and contains 3D CAD models of 10 object categories This dataset comprises 4,899 object models, divided into a training set of 3,991 objects and a testing set of 908 objects The raw data format of ModelNet10 is in the OFF (Object File Format), which represents mesh objects.

This study leveraged the ModelNet10 dataset for its labeled point cloud data, crucial for evaluating classification performance ModelNet10's raw mesh model format allows for flexible point cloud conversion, aligning with the research focus on point cloud compression Additionally, its status as a subset of the larger ModelNet dataset facilitated quick testing and optimized framework development.

This research uses dense point cloud geometry for evaluation, drawing on common practices in point cloud compression where random sampling of 5 × 10^5 points on mesh surfaces is employed, representing a dense point cloud Resolution, as highlighted by Wang et al., is another crucial parameter in point cloud geometry compression studies.

For computational efficiency, especially in resource-limited environments, point cloud data is often represented using voxels with resolutions (r) of 256 or lower To explore the impact of resolution, we preprocessed the ModelNet10 dataset, sampling 5x10^5 points at three resolutions (r = 64, 128, and 256) for our analysis.

Figure 4.1: Example point cloud data from ModelNet10 dataset with three resolutions (r): (a) 64, (b) 128, and (c) 256.

While evaluating performance on the entire ModelNet10 dataset offers a comprehensive view, it is time-consuming To streamline evaluation, we selected a subset of 50 mesh objects (5 per class) for human visualization benchmarking This subset was randomly chosen from the generated point cloud data, following our preprocessing strategy The specific ModelNet10 data used for compression performance evaluation is detailed in Table 4.1.

For the experiments related to the classification benchmark, most methods utilize sparse point cloud data

Table 4.1: The statistic information of the point cloud data in the selected ModelNet10 dataset.

To work with sparse point cloud data, we'll use the Farthest Point Sampling (FPS) algorithm [54] to generate sparse point clouds with 1024 points from dense point clouds This approach ensures efficient processing while preserving essential information.

Experiments

Implementation

We fine-tuned the pre-trained PCGCv2 model on the ModelNet10 dataset, voxelized at 128 resolution We trained seven models with varying bit rates by adjusting the λ parameter in the loss function, using values from 0.25 to 10.0 The Adam optimizer with an initial learning rate of 0.0008 was used for ten epochs with a batch size of 8, halving the learning rate after each epoch until reaching 1e-5.

After training PCGCv2 on the ModelNet10 dataset, its encoder, entropy bottleneck, and decoder are frozen The base branch is then trained for point cloud classification, optimizing accuracy across various values of λb, a hyperparameter controlling the base branch's learning rate The training loss function for the base branch is specifically defined to achieve this goal.

, (4.1) whereDbis used to measure the distortion of point cloud classification In this work, we use cross-entropy loss for the classification task.

A representationZ b optimized forD b may include information that is not necessarily beneficial for the reconstruction ofY Furthermore, the information inZ b is structured in a manner suitable for the computer vision taskT, and mapping it back to the latent spaceY throughh r (ã)can be challenging To address these challenges, we add a minor reconstruction penalty on a transformedZbin the final loss function for training the base branch with the hyperparameterβ= 0.1 To cover a wide range of bit rates, we vary theλ b parameter in the loss function with values including 20000,16000,12000,8000,4000,1000,600,200 and 160. During training, we utilize the Adam optimizer with an initial learning rate set to 0.001, and the learning rate decreases by a multiplicative factor of 0.1 after every 5 epochs The base branch is trained for 100 epochs with a batch size of 16 If the validation loss does not decrease after 5 epochs, the training process is stopped.

After training the base branch, our initial goal is to maximize the performance of the machine vision task with an appropriate amount of bits Therefore, based on the results of the base branch, we will select a model with good performance to continue using for training the enhancement branch To train the enhancement branch, we freeze the layers used in the base branch The loss function for training the enhancement branch is defined as follows:

The enhancement branch was trained using the Adam optimizer with an initial learning rate of 0.005, halved each epoch until reaching 1e−5, for a maximum of 50 epochs Early stopping was implemented, ending training if the validation loss did not decrease for five consecutive epochs A single rate control value, λr of 0.5, was employed throughout the training process.

This study compared the classification performance of PointNet, PointNet++, and MinkPointNet We implemented PointNet and PointNet++ using open-source code from Yan (2019), while MinkPointNet was implemented with the MinkowskiEngine library by Choy et al (2019) Default parameter settings were used for training all models.

This research implemented Ulhaq and Bajic's (2023) point cloud compression method for classification, utilizing their GitHub repository Additionally, a general-purpose codec, G-PCC, was integrated with PointNet++ for point cloud classification, demonstrating a non-specialized approach.

This study evaluated human visualization performance using two methods: G-PCC and GEO-CNNv2 G-PCC, obtained from the MPEG GitHub repository, was modified by adjusting the positionQuantizationScale parameter to different values while keeping other settings default For GEO-CNNv2, we implemented the model architecture using PyTorch based on the open-source code OpenPointCloud Training parameters were kept consistent with the original paper, and the bitrate was varied by adjusting the λ parameter in the loss function to different values.

Performance assessment

For point cloud classification, the improvement in rates and accuracies relative to the codec is calculated using the Bjứntegaard-Delta (BD) method [27].

To visually compare the performance of different methods, we analyze their rate-distortion curves using the BD-rate This metric is calculated by comparing the Mean Squared Error (MSE) of both point-to-point (D1) and point-to-plane (D2) distances, which serve as distortion metrics, against the bit rate measured in bits per pixel (bpp) This analysis allows us to understand how well each method compresses the data while maintaining image quality at different bit rates.

Results

Point cloud compression for classification

Figure 4.2 illustrates the rate-accuracy trade-offs of the proposed codec against other input compression codecs, alongside comparisons with PointNet++, PointNet, and MinkPointNet – popular point cloud classification methods This comparative analysis provides insights into the performance of the proposed codec in the context of point cloud processing.

PointNet++ achieved the highest accuracy, surpassing MinkPointNet and PointNet by 1-2% MinkPointNet and PointNet exhibited similar performance levels The proposed method's accuracy was comparable to MinkPointNet, demonstrating its effectiveness in point cloud analysis.

The results show that [50] achieved the highest accuracy rates in point cloud classification Table 4.2 presents the peak accuracies of each codec compared to the baseline, G-PCC + PointNet++, measured by Bjórntegaard-Delta (BD) improvements.

Table 4.2: BD rate and maximum accuracies per codec evaluated on ModelNet10 voxelized with three resolutions P is the number of points in the inputX "*" denotes dense point cloud.

Codec Max acc(%) BD-rate (%) r= 64

Ulhaq and Bajic (2023) [P24] 91.1 -98.9 Proposed Codecλ= 0.25 [P=*] 92.2 -76.6 Proposed Codecλ= 0.5 [P=*] 91.9 -68.3

Ulhaq and Bajic (2023) [P24] 91.2 -98.9 Proposed Codecλ= 0.25 [P=*] 92.5 -72.5 Proposed Codecλ= 0.5 [P=*] 91.9 -62.3 Proposed Codecλ= 1.0 [P=*] 92.1 -70.4 Proposed Codecλ= 2.0 [P=*] 92.0 -56.5

Ulhaq and Bajic (2023) [P24] 91.0 -98.9Proposed Codecλ= 0.25 [P=*] 91.4 -61.6Proposed Codecλ= 0.5 [P=*] 91.2 -62.7Proposed Codecλ= 1.0 [P=*] 91.9 -53.8Proposed Codecλ= 2.0 [P=*] 91.3 -66.9Proposed Codecλ= 4.0 [P=*] 92.1 -62.0Proposed Codecλ= 6.0 [P=*] 91.7 -56.6Proposed Codecλ= 10.0 [P=*] 91.4 -56.6

PointNet++ [official] [93.20%] PointNet [official] [91.56%] MinkPointNet [91.74%]

Ulhaq and Bajic (2023) [P24] G-PCC + PointNet++ [P24] Proposed Codec = 0.25 [P=*] Proposed Codec = 0.5 [P=*] Proposed Codec = 1.0 [P=*] Proposed Codec = 2.0 [P=*] Proposed Codec = 4.0 [P=*] Proposed Codec = 6.0 [P=*] Proposed Codec = 10.0 [P=*]

Figure 4.2: Rate-accuracy curves evaluated on the ModelNet10 dataset voxelized with three resolutions (r):(a) 64, (b) 128, and (c) 256 P is the number of points in the inputX "*" denotes a dense point cloud.

Point cloud compression for humans

This experiment evaluates the adaptability of compression methods across different resolutions Our method excels at low resolutions but its performance decreases with increasing resolution compared to other approaches GEO-CNNv2 demonstrates a significant improvement in compression performance as resolution increases The residual method has a minimal impact on coding time, remaining comparable to the original method, with all methods achieving encoding times below 1 second However, GEO-CNNv2 has the longest decoding time.

Table 4.3: BD-rate gains on ModelNet10 of the proposed codec against other compression methods using D1 and D2 based BD-rate measurement.

Metrics G-PCC GEO-CNNv2 PCGCv2 PCGCv2

Table 4.4: Average coding time of different methods on the ModelNet10 dataset.

Methods G-PCC GEO-CNNv2 PCGCv2 PCGCv2 (finetuning) Proposed codec r= 64

Discussion

Figure 4.2 compares the classification performance of PointNet, PointNet++, and MinkPointNet, demonstrating that voxelization has no significant impact on their accuracy This is attributed to the normalization of input data, ensuring consistent processing across different methods.

G-PCC GEO_CNNv2 PCGCv2 PCGCv2 (finetuning) Proposed Codec

Figure 4.3: Rate-distortion curves of ModelNet10 dataset: (left) D1-based PSNR, (right) D2-based PSNR with three resolutions (r) (a) 64, (b) 128, and (c) 256. additional attribute information (e.g., normals, etc) can enhance classification efficiency [22, 50] MinkPointNet is the simplest method in terms of implementation, with the fewest convolution layers compared to PointNet and PointNet++ However, its classification performance is still acceptable Since the proposed task-specific codec uses MinkPointNet as the backbone, improving the performance of MinkPointNet is essential for its application in point cloud compression for classification.

The proposed method outperforms non-specific codecs, as shown in Figure 4.2 and Table 4.2 While it currently lags behind Ulhaq and Bajic's (2023) method in rate-accuracy, our approach tackles two tasks simultaneously, focusing on dense point cloud geometry data, which may contribute to this difference.

While PCGCv2 methods demonstrate decreased compression performance at higher resolutions, even with fine-tuning, GEO-CNNv2 excels with ModelNet at higher resolutions This suggests that the partition process in GEO-CNNv2, which compresses smaller blocks rather than the entire point cloud, enhances compression efficiency The ModelNet dataset's lower performance compared to MPEG datasets might be attributed to its more complex point cloud data generated from meshes, containing internal structural information However, this study's limited hardware resources restricted experiments to resolutions below 256, highlighting a need for further exploration at higher resolutions.

Table 4.4 provides information on the coding time of the methods It is evident that as the resolution increases, the encoding time also increases accordingly In terms of coding time, the proposed method demonstrates acceptable encoding times, staying under 1 second We do not observe significant differences when comparing the encoding times of the proposed method and PCGCv2, indicating that integrating an additional classification task does not significantly increase the encoding time For GEO-CNNv2, the decoding time increases significantly at higher resolutions This is due to the fact that reconstructing the original point cloud from compressed data requires performing several computationally intensive steps,especially when dealing high-resolution point clouds, highlighting a limitation of GEO-CNNv2.

Conclusions

This research introduces a scalable point cloud coding framework using the residual method, designed for both machine vision (classification) and human visualization (reconstruction) The method's effectiveness is evaluated on the ModelNet10 dataset, demonstrating comparable performance in classification tasks While the method shows promise for reconstruction, its effectiveness diminishes with high-resolution point cloud data, though results remain comparable to established methods.

Future works

This study identifies several promising avenues for future research These include enhancing the performance of human visualization, particularly for high-resolution data, investigating the method's adaptability across diverse datasets, and exploring the integration of attribute information for improved classification and reconstruction.

To ensure the practical utility of our framework, it's vital to assess its scalability across diverse scenarios Additionally, a thorough examination of how point cloud data characteristics, such as density, sparsity, and structural complexity, impact performance can provide valuable insights for future optimizations.

[1] M Alain, E Zerman, C Ozcinar, and G Valenzise, “Introduction to immersive video technologies,” inImmersive Video Technologies Elsevier, 2023, pp 3–24.

[2] S Schwarz, M Preda, V Baroncini, M Budagavi, P Cesar, P A Chou, R A Cohen, M Krivoku´ca,

S Lasserre, Z Liet al., “Emerging mpeg standards for point cloud compression,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol 9, no 1, pp 133–148, 2018.

[3] E Alexiou, N Yang, and T Ebrahimi, “Pointxr: A toolbox for visualization and subjective evaluation of point clouds in virtual reality,” in2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) IEEE, 2020, pp 1–6.

[4] S Lim, M Shin, and J Paik, “Point cloud generation using deep adversarial local features for augmented and mixed reality contents,”IEEE Transactions on Consumer Electronics, vol 68, no 1, pp 69–76, 2022.

[5] Q Wang and M.-K Kim, “Applications of 3d point cloud data in the construction industry: A fifteen-year review from 2004 to 2018,”Advanced Engineering Informatics, vol 39, pp 306–319, 2019.

Franczuk et al (2022) investigated the direct use of point clouds for real-time interaction with cultural heritage in tourism, specifically focusing on the Kłodzko Fortress in the context of pandemic and post-pandemic travel Their research, published in Digital Applications in Archaeology and Cultural Heritage, highlights the potential of point clouds to enhance the visitor experience in a safe and engaging manner.

[7] G Valenzise, M Quach, D Tian, J Pang, and F Dufaux, “Point cloud compression,” inImmersive Video Technologies Elsevier, 2023, pp 357–385.

[8] J Ballé, D Minnen, S Singh, S J Hwang, and N Johnston, “Variational image compression with a scale hyperprior,”arXiv preprint arXiv:1802.01436, 2018.

[9] D Minnen, J Ballé, and G D Toderici, “Joint autoregressive and hierarchical priors for learned image compression,”Advances in neural information processing systems, vol 31, 2018.

[10] M Quach, G Valenzise, and F Dufaux, “Improved deep point cloud geometry compression,” in2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP) IEEE, 2020, pp 1–6.

[11] J Wang, D Ding, Z Li, X Feng, C Cao, and Z Ma, “Sparse tensor-based multiscale representation for point cloud geometry compression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[12] Y Guo, H Wang, Q Hu, H Liu, L Liu, and M Bennamoun, “Deep learning for 3d point clouds:

A survey,” IEEE transactions on pattern analysis and machine intelligence, vol 43, no 12, pp. 4338–4364, 2020.

[13] H Lu and H Shi, “Deep learning for 3d point cloud understanding: A survey,” arXiv preprint arXiv:2009.08920, 2020.

[14] M J Gómez, F García, D Martín, A de la Escalera, and J M Armingol, “Intelligent surveillance of indoor environments based on computer vision and 3d point cloud fusion,”Expert Systems with Applications, vol 42, no 21, pp 8156–8171, 2015.

[15] Y Cui, R Chen, W Chu, L Chen, D Tian, Y Li, and D Cao, “Deep learning for image and point cloud fusion in autonomous driving: A review,” IEEE Transactions on Intelligent Transportation Systems, vol 23, no 2, pp 722–739, 2021.

[16] M Rutzinger, B Hửfle, M Hollaus, and N Pfeifer, “Object-based point cloud analysis of full- waveform airborne laser scanning data for urban vegetation classification,”Sensors, vol 8, no 8, pp.4505–4528, 2008.

[17] E Alexiou, E Upenik, and T Ebrahimi, “Towards subjective quality assessment of point cloud imaging in augmented reality,” in2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) IEEE, 2017, pp 1–6.

[18] M Quach, G Valenzise, and F Dufaux, “Learning convolutional transforms for lossy point cloud geometry compression,” in2019 IEEE international conference on image processing (ICIP) IEEE,

[19] D T Nguyen, M Quach, G Valenzise, and P Duhamel, “Lossless coding of point cloud geometry using a deep generative model,”IEEE Transactions on Circuits and Systems for Video Technology, vol 31, no 12, pp 4617–4629, 2021.

[20] A Javaheri, C Brites, F Pereira, and J Ascenso, “Point cloud rendering after coding: Impacts on subjective and objective quality,”IEEE Transactions on Multimedia, vol 23, pp 4049–4064, 2020.

[21] C R Qi, H Su, K Mo, and L J Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition,

[22] C R Qi, L Yi, H Su, and L J Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol 30, 2017.

[23] I J S W (ITU-TSG16), “Final call for proposals on jpeg pleno point cloud coding,” in94th Meet., 2022.

[24] I V Baji´c, W Lin, and Y Tian, “Collaborative intelligence: Challenges and opportunities,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp 8493–8497.

[25] A de Andrade, A Harell, Y Foroutan, and I V Baji´c, “Conditional and residual methods in scalable coding for humans and machines,”arXiv preprint arXiv:2305.02562, 2023.

[26] H Schwarz, D Marpe, and T Wiegand, “Overview of the scalable video coding extension of the h.

264/avc standard,”IEEE Transactions on circuits and systems for video technology, vol 17, no 9, pp. 1103–1120, 2007.

[27] C Cao, M Preda, and T Zaharia, “3d point cloud compression: A survey,” inThe 24th International Conference on 3D Web Technology, 2019, pp 1–9.

[28] S Gumhold, Z Kami, M Isenburg, and H.-P Seidel, “Predictive point-cloud compression,” inACM SIGGRAPH 2005 Sketches, 2005, pp 137–es.

[29] T Ochotta and D Saupe, “Compression of point-based 3d models by shape-adaptive wavelet coding of multi-height fields,” 2004.

[30] J.-M Lien, G Kurillo, and R Bajcsy, “Multi-camera tele-immersion system with real-time model driven data compression: A new model-based compression method for massive dynamic point data,” The Visual Computer, vol 26, pp 3–15, 2010.

[31] D C Garcia and R L de Queiroz, “Intra-frame context-based octree coding for point-cloud geometry,” in2018 25th IEEE International Conference on Image Processing (ICIP) IEEE, 2018, pp 1807– 1811.

[32] T Golla and R Klein, “Real-time point cloud compression,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) IEEE, 2015, pp 5087–5092.

[33] R L De Queiroz and P A Chou, “Compression of 3d point clouds using a region-adaptive hierarchical transform,”IEEE Transactions on Image Processing, vol 25, no 8, pp 3947–3956, 2016.

[34] J S W N ISO/IEC, “Call for proposals for point cloud compression v2,” inMPEG 118 - Hobart, 2017.

[35] N Ohji, L Sebastien, T Sugio, and P Marius, “White paper on g-pcc,”ISO/IEC JTC 1/SC 29/AG 03 N0111, 2023.

[36] MPEG3DGC, “G-pcc codec description,”ISO/IEC JTC 1/SC 29/WG 7 N 0011, 2020.

[37] J Wang, H Zhu, H Liu, and Z Ma, “Lossy point cloud geometry compression via end-to-end learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol 31, no 12, pp. 4909–4923, 2021.

[38] W Yan, S Liu, T H Li, Z Li, G Liet al., “Deep autoencoder-based lossy geometry compression for point clouds,”arXiv preprint arXiv:1905.03691, 2019.

[39] L Gao, T Fan, J Wan, Y Xu, J Sun, and Z Ma, “Point cloud geometry compression via neural graph sampling,” in2021 IEEE International Conference on Image Processing (ICIP) IEEE, 2021, pp. 3373–3377.

[40] Z Que, G Lu, and D Xu, “Voxelcontext-net: An octree based framework for point cloud compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6042–6051.

[41] L Huang, S Wang, K Wong, J Liu, and R Urtasun, “Octsqueeze: Octree-structured entropy model for lidar compression,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp 1313–1323.

[42] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp 770–778.

[43] C Szegedy, S Ioffe, V Vanhoucke, and A Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” inProceedings of the AAAI conference on artificial intelligence, vol 31, no 1, 2017.

[44] J Wang, D Ding, Z Li, and Z Ma, “Multiscale point cloud geometry compression,” in2021 Data Compression Conference (DCC) IEEE, 2021, pp 73–82.

[45] M Mirbauer, M Krabec, J Kˇrivánek, and E Šikudová, “Survey and evaluation of neural 3d shape classification approaches,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 44, no 11, pp 8635–8656, 2021.

[46] C Choy, J Gwak, and S Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp 3075–3084.

[47] H Choi and I V Baji´c, “Scalable image coding for humans and machines,”IEEE Transactions on Image Processing, vol 31, pp 2739–2754, 2022.

[48] L Liu, Z Hu, and J Zhang, “Pchm-net: A new point cloud compression framework for both human vision and machine vision,” in2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023, pp 1997–2002.

[49] X Ma, Y Xu, X Zhang, L Tang, K Zhang, and L Zhang, “Hm-pcgc: A human-machine balanced point cloud geometry compression scheme,” in 2023 IEEE International Conference on Image Processing (ICIP) IEEE, 2023, pp 2265–2269.

[50] M Ulhaq and I V Baji´c, “Learned point cloud compression for classification,” in2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP) IEEE, 2023, pp 1–6.

[51] N Tishby, F C Pereira, and W Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.

In their 2022 paper presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition, He et al proposed Elic, an efficient learned image compression method that employs unevenly grouped space-channel contextual adaptive coding.

[53] Z Wu, S Song, A Khosla, F Yu, L Zhang, X Tang, and J Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp 1912–1920.

[54] Y Eldar, M Lindenbaum, M Porat, and Y Y Zeevi, “The farthest point strategy for progressive image sampling,”IEEE Transactions on Image Processing, vol 6, no 9, pp 1305–1315, 1997.

[55] X Yan, “Pointnet pointnet++ pytorch,” 2019 [Online] Available: https://github.com/yanx27/Pointnet_Pointnet2_pytorch

[56] W Gao, H Ye, G Li, H Zheng, Y Wu, and L Xie, “Openpointcloud: An open-source algorithm library of deep learning based point cloud compression,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp 7347–7350.

[57] C.-H Wu, C.-F Hsu, T.-K Hung, C Griwodz, W T Ooi, and C.-H Hsu, “Quantitative comparison of point cloud compression algorithms with pcc arena,”IEEE Transactions on Multimedia, 2022.

Tiêu đề	Point Cloud Compression for Humans and Machines
Tác giả	LÊ Quoc Anh
Người hướng dẫn	Giuseppe Valenzise, LÊ Vu Ha
Trường học	Paris-Saclay University
Chuyên ngành	Electrical Engineering
Thể loại	Master Thesis
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	52
Dung lượng	1,39 MB