Abstract
Point cloud is a crucial geometric data type. Due to its peculiar format, most researchers transform such data. This, however, tends to cause issues. PointNet (2017) is a novel type of neural network that directly consumes point cloud data while respecting input permutation invariance, an important characteristic of point cloud data. The authors provide a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing – where PointNet shows strong performance in each, on par or even better than state of the art. Additionally, the authors provide experiments validating PointNet design choices, and theoretical analysis for understanding its empiric robustness with respect to input perturbation and corruption.
Note: This is a textomarkdown pandoc
transformation of a seminar report
from summer semester 2022. The transformation was not 100% successful, but I
fixed all links and most formatting. You can find my presentation
here.
Introduction
PointNet (Charles R. Qi et al. 2017) explores a deep learning architecture capable of reasoning about 3D point cloud data. Previous architectures (Zhirong Wu et al. 2015; Maturana and Scherer 2015; Charles R. Qi et al. 2016; Su et al. 2015; Fang et al. 2015; Guo, Zou, and Chen 2015) required transformation to other representations, typically voxel grids or collections of images (e.g. views from different camera angles) before feeding them to a deep net architecture. These intermediate representations and their transformations might however obscure or modify important properties, such as invariance to geometric transformations or invariance over input order permutations. Point clouds avoid the combinatorial complexities and irregularities of meshes (e.g. from voxelization) and other intermediate representations. To maintain invariance over input order permutations and invariance over certain geometric transformations, it becomes necessary to make certain computations symmetric.
The network uses point clouds as input, and outputs class labels based on the entire input point cloud or per point. Individual points are processed identically and independently in the initial stages, from just their (x,y,z) coordinates, optionally with additional dimensions computed from normals and other local or global features, e.g., point density. The key to this approach is a single symmetric function for integrating the local features of different points to a global feature descriptor. The network effectively learns a set of criteria for interesting or informative points of the point cloud. These criteria are then combined for all points in the point cloud, and fully connected layers aggregate to a global set function approximator, which can be used for shape classification or perpoint labeling (shape segmentation).
The paper provides a theoretical analysis and experimental evaluation of the approach. It shows that the network can approximate any continuous set function. The network summarizes an input cloud by a sparse set of key points, which roughly corresponds to the outline of objects when visualized (see Sec. 3.3.4). This explains the high robustness against small perturbations to input points and corruption through point insertion of outliers or deletion. When compared on a number of benchmark datasets, ranging from shape classification, part segmentation to scene segmentation, PointNet dominates in speed (see Table IV) and shows strong performance on par or better than state of the art at time of publication while providing a unified architecture (see Table II and Table I).
The key contributions of the PointNet paper are as follows:

Design of a novel deep net architecture suitable for unordered point sets in 3D;

Showing how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;

Thorough empirical and theoretical analysis on stability and efficiency of the method;

Illustration of the 3D features computed by the selected neurons;

Developing intuitive explanations for its performance.
The problem of processing unordered sets by neural nets is a very general and fundamental problem, resulting in the transferal of key ideas to many other problems and their respective domains, which explains the 7549 citations at time of writing. The paper was a trailblazer at its time and is still highly influential today.
Related Work
Deep Learning on 3D Data
Volumetric CNNs (Zhirong Wu et al. 2015; Maturana and Scherer 2015; Charles R. Qi et al. 2016) apply conventional 3d convolutional neural networks on voxelized shapes. However, data sparsity and computation cost of 3d convolution constrain the resolution of volumetric representation.
MultiviewCNNs (Su et al. 2015; Charles R. Qi et al. 2016) first render 3D point cloud in multiple 2D images and apply 2D conv nets for image classification. Given sufficient computational resources, they achieve dominating performance due to wellengineered image CNNs. However, it is difficult to extend imageCNNs to other 3D or pointbased tasks.
Featurebased DNNs (Fang et al. 2015; Guo, Zou, and Chen 2015) extract traditional shape features and convert 3d data to a vector before using a fully connected net for shape classification. They appear to be limited by the representative power of the features extracted.
Deep Learning on Unordered Sets
A point cloud is an unordered set of vectors from the data structure point of view. Most works in deep learning however look at regular input structures like ordered sequences of images, volumes or points. Unordered point clouds are rarely considered. One recent work (Vinyals, Bengio, and Kudlur 2015) attempts to impose order on unordered input sets via the attention mechanism. This work focuses on generic sets and NLP applications, which lacks the characteristics of geometry in the sets.
Based on PointNet
There exist a number of works explaining, applying and building upon PointNet. The influence of PointNet can furthermore be seen in the ecosystem of different implementations and tools for visualization (charlesq34 2019; aldipiroli 2021; yunxiaoshi 2021; Yan 2019). Different attempts to explain what PointNet learned (B. Zhang et al. 2019; Huang et al. 2019) exist, and many apply PointNet to different domains and different problems (Thiery et al. 2022; GutiérrezBecker and Wachinger 2018; Triess et al. 2021; Liang et al. 2019; W. Zhang et al. 2018; Mrowca et al. 2018).
Derivations of PointNet
PointNet, in addition to being successful itself, sees successful use as a module similar to a convolution layer in more sophisticated neural network architectures. This can best be seen in recent architectures such as PointNet++ (Charles Ruizhongtai Qi et al. 2017), VoxNet (Maturana and Scherer 2015), and Syncspeccnn (Yi et al. 2017). Moreover, an even larger number of architectures is being heavily inspired by PointNet or adapts core ideas of the PointNet architecture (Jiang, Wu, and Lu 2018; Wang et al. 2018; Yu et al. 2018; GutiérrezBecker and Wachinger 2018; Charles R. Qi et al. 2018; Li et al. 2018). This includes architectures both for 3D point cloud data and traditionally ordered input data structures.
Problem Statement
The authors design a neural network architecture that directly consumes unordered point sets as input. A point cloud is a set of 3D points {P_{i}i = 1, …, n}. Each point P_{i} is a vector of (x,y,z) coordinates and additional feature channels such as color, normal etc. For simplicity and clarity, unless otherwise noted, only the (x,y,z) coordinates are used as a point’s feature channels.
Input point clouds for object classification are either directly sampled from a shape face or presegmented from a scene point cloud. The output of PointNet are k classification scores for the k candidate classes. For semantic part region segmentation, the input is a single object or subvolume from a 3D scene. For n input points, PointNet will output n × m scores for each of the m semantic subcategories.
mean  aero  bag  cap  car  chair  ear  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  

phone  board  
# shapes  2690  76  55  898  3758  69  787  392  1547  451  202  184  283  66  152  5271  
Wu (Zizhao Wu et al. 2014)    63.2        73.5        74.4              74.8 
Yi (Yi et al. 2016)  81.4  81.0  78.4  77.7  75.7  87.6  61.9  92.0  85.4  82.5  95.7  70.6  91.9  85.9  53.1  69.8  75.3 
3DCNN  79.4  75.1  72.8  73.3  70.0  87.2  63.5  88.4  79.6  74.4  93.9  58.7  91.8  76.4  51.2  65.3  77.1 
PointNet  83.7  83.4  87.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
Table I: Segmentation results on ShapeNet part dataset. The metric used is mIoU(%) on points. As a baseline, we compare with a 3D fully convolutional network pipeline proposed by us as well as two traditional methods (Zizhao Wu et al. 2014) and (Yi et al. 2016). The authors observe that PointNet achieved a new stateoftheart in mIoU. Table from (Charles R. Qi et al. 2017).
Deep Learning on Point Sets
The architecture of PointNet (Sec. 3.3) is guided by the characteristics of point sets in ℝ^{n} (Sec. 2.1).
Properties of Point Sets in ℝ^{n}
Unordered
Contrary to most regular data structures, which depend on order, point cloud is a set of points without specific order. A network that consumes 3D point sets of size N needs to be invariant to permutations in input data order.
Interaction among points
Similar to more regular data structures however, albeit on the topological level, points are not isolated and form local and global structures where neighboring points form a meaningful subset. Hence, the model needs to be able to recognize the spatial relations and importance of points.
Invariance under transformations
The important information of point cloud data is the spatial relation between points and their relative location. Thus, point cloud data is inherently invariant to geometric transformations such as translation or rotation, and neither should modify point cloud classification and segmentation results.
PointNet Architecture
The Architecture of PointNet and its pipeline is explained in Figure 2 and its caption. The three most important parts are, first, the datadependent canonicalization through two TNets, generating a 3 × 3 and 64 × 64 orthogonal affine transformation matrix which is applied to each data point individually. Second, a local and global information combination structure for segmentation, in which local features are fused with global features of the point cloud. Third, and most importantly, the max pooling layer is used as a symmetric function to aggregate information from all the points, staying invariant over permutations to input order.
Effectively, each TNet is a miniPointNet with multiple MLP layers and max pooling, taking the raw point cloud as input and regressing to an 3 × 3 and 64 × 64 affine transformation matrix for T1 and T2 respectively. A small regularization loss is added to the softmax classification loss to make the matrix close to orthogonal. See Sec. 3.3.2 and supplementary material for details.
Theoretical Analysis
The authors further show the universal approximation ability of the PointNet neural network to continuous set functions. The detailed proof for this is provided in the supplementary material to (Charles R. Qi et al. 2017). Relevant insights include the following:

In the worst case, the network can learn to convert a point cloud to a volumetric representation, though visualization shows that the network learns a much smarter strategy.

Intuitively, since the set function is continuous, a small perturbation to the input set should barely affect the function values, e.g., classification or segmentation scores.

By implication of using max over all local feature vectors to obtain the global feature vector, a finite subset of input points called critical point set fully determines the result of a classification with PointNet.
When combined, this explains the robustness seen w.r.t. point perturbation, corruption and extra noise points (see Sec. 3.3.3 on robustness tests). Intuitively, the network summarizes a shape by a sparse set of key points (see Sec. 3.3.4 for details and visualization).
input  #views  accuracy  accuracy  

avg. class  overall  
SPH (Kazhdan, Funkhouser, and Rusinkiewicz 2003)  mesh    68.2   
3DShapeNets (Zhirong Wu et al. 2015)  volume  1  77.3  84.7 
VoxNet (Maturana and Scherer 2015)  volume  12  83.0  85.9 
Subvolume (Charles R. Qi et al. 2016)  volume  20  86.0  89.2 
LFD (Zhirong Wu et al. 2015)  image  10  75.5   
MVCNN (Su et al. 2015)  image  80  90.1   
Custom baseline  point    72.6  77.4 
PointNet  point  1  86.2  89.2 
Table II: Classification results on ModelNet40. PointNet achieves stateoftheart among deep nets on 3D input. Table from (Charles R. Qi et al. 2017).
Experiments
Experiments are divided in four parts: standard benchmark comparison for classification (A) and part segmentation (B) tasks, empirical validation of architectural design choices (C), and analysis of time and space complexity (D).
3D Object Classification
As seen in Sec. 2.3, PointNet is a general set function approximator and can thus be trained for classification. Comparison with thenstateoftheart on the ModelNet40 (Zhirong Wu et al. 2015) dataset can be found in Table IV. Previous methods focused on volumetric and multiview image representations, while PointNet is the first to directly work on raw point cloud.
1024 points are uniformly sampled on mesh faces and normalized into a unit sphere. Augmentation during training is done by randomly rotating the object and adding Gaussian noise to the position of coordinates with zero mean and a standard deviation of 0.02.
The baseline is using MLP on traditional features extracted from point cloud (point density, D2, shape contour etc.). The authors think the small gap to multiview based methods (MVCNN (Su et al. 2015)) is due to more granular geometric details that can be captured by rendered images.
3D Object Part Segmentation
Part segmentation is a demanding recognition task. Based on a 3D scan or mesh, the task is to assign a part category label (e.g. chair leg, cup handle) to each point or face. The authors evaluate performance on ShapeNet (Yi et al. 2016), with contains labels for 50 parts in 16 different categories, where most categories are labeled with two to five parts.
Part segmentation is formulated as a perpoint classification problem, with the evaluation metric mIoU (mean Intersection over Union) on points. Due to the nature of geometric intersections, values above 0.5 are traditionally seen as accurate classification, though generally higher is better.
The authors compare PointNet with two traditional methods (Zhirong Wu et al. 2015; Yi et al. 2016) taking advantage of pointwise geometry features, as well as a custom 3D CNN baseline. See supplementary materials for details on modifications and architecture for the 3D CNN. Percategory and mean IoU(%) scores can be found in Table I. The authors observe that PointNet beats the selected reference methods in most categories.
Architecture Design Analysis
In this section, we validate architectural design choices with empirical results.
Comparison with Alternative Orderinvariant Methods
Experiments on alternative methods for orderinvariance showed that max has empirically the best results (accuracy of 87.1) as the symmetric function used when compared with alternatives evaluated on the ModelNet40 shape classification problem. Alternatives include $\verb!avg!$pooling (83.8), attention sum (83.0), LSTMs (78.5), and MLPs with sorted (45.0) and unsorted (24.2) input (Vinyals, Bengio, and Kudlur 2015). For more information, see Section 5.2 of (Charles R. Qi et al. 2017).
Transform  accuracy 

none  87.1 
input (3x3)  87.9 
feature (64x64)  86.9 
feature (64x64) + reg.  87.4 
both  89.2 
Table III: Effects of input feature transforms. Based on overall classification accuracy on the ModelNet40 (Zhirong Wu et al. 2015) test set. Table from (Charles R. Qi et al. 2017).
Effectiveness of Input and Feature Transformations
Interestingly, as can be seen in Table III, even the most basic PointNet architecture without TNet transforms gives decent results. Additional input and feature transformations with regularization towards an orthogonal transformation matrix (see (Charles R. Qi et al. 2017) and supplemental material for details) provides an accuracy improvement of 2.1%.
Robustness Test
PointNet, while simple and effective, is robust against both removing points and adding noise, as can be seen by barely decreased accuracy on ModelNet40 under various corruptions in Figure 3. Training with additional point density information (XYZ+density) increases accuracy for inputs with less than 30% outliers. Even without it, the net has 80% accuracy even when 20% of the points are outliers. The reasons for losing only small bits of accuracy (2.4% on furthest sampling, 3.8% on random sampling with 50% points missing) is explored in the next part, Sec. 3.3.4. Additional experiments on partial data via simulated Kinect scans and complete ShapeNet CAD models show that albeit partial data is rather challenging, predictions are robust and reasonable (for details see (Charles R. Qi et al. 2017)).
Critical points and upper bound shape
Previously noted in Sec. 2.3, the network summarizes a shape by a sparse set of key points, called critical point set. This is due to the continuity of local feature functions, and the propensity of max to ignore the values of local features when lower than those provided by the critical point set. Adding points with local feature values between those of the critical point set generates the upper bound shape. Examples for both of these sets and the original shape are visualized in Figure 4. As can be seen in the visualization, the critical point set roughly corresponds to the skeleton of objects.
#params  FLOPs/sample  

PointNet (vanilla)  0.8M  148M 
PointNet  3.5M  440M 
Subvolume (Charles R. Qi et al. 2016)  16.6M  3633M 
MVCNN (Su et al. 2015)  60.0M  62057M 
Table IV: Time and space complexity of different deep learning architectures for 3D data classification. PointNet (vanilla) is the classification PointNet without input and feature TNet transformation networks. FLOP is floatingpoint operations. The “M” stands for a million units. Both Subvolume and MVCNN used input data pooling from multiple rotations or views, without which they have much inferior performance. Table from (Charles R. Qi et al. 2017).
Time and Space Complexity Analysis
PointNet, due to its simple architecture and few parameters, can process more than one million points per second for point cloud classification (around 1K objects/second) or semantic segmentation (around 2 rooms/second) with a 1080X GPU on TensorFlow, which enables usage in realtime application. Previous methods required substantially more computation (see Table IV), both for evaluation but especially for training, due to the higher number of parameters. Trading a few percent in accuracy for a computation speedup of about 3 thus makes PointNet (vanilla) a strong candidate for complex realtime and realworld applications.
Conclusion and Outlook
PointNet is a newly proposed deep neural network architecture directly consuming point cloud data. This enables a unified approach to a number of 3D recognition tasks, including object classification, part segmentation and semantic segmentation. On each of these tasks, results on par or better than state of the art are obtained on standard benchmarks. Additionally, theoretical analysis and experimental validation of architectural design choices provide a deeper understanding.
Through its characteristic of a universal continuous set function approximator, PointNet sees usage as a module in other architectures for 3D point cloud data. Additionally, the core ideas of PointNet find successful adaptation by many architectures not just for 3D point cloud data, but also for many traditionally ordered input data structures.
Sources
 aldipiroli. 2021. “pointnet.” https://github.com/Aldipiroli/Pointnet, August.
 charlesq34. 2019. “pointnet.” https://github.com/Charlesq34/Pointnet, September.
 Fang, Yi, Jin Xie, Guoxian Dai, Meng Wang, Fan Zhu, Tiantian Xu, and Edward Wong. 2015. “3d Deep Shape Descriptor.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2319–28.
 Guo, Kan, Dongqing Zou, and Xiaowu Chen. 2015. “3d Mesh Labeling via Deep Convolutional Neural Networks.” ACM Transactions on Graphics (TOG) 35 (1): 1–12.
 GutiérrezBecker, Benjamı́n, and Christian Wachinger. 2018. “Deep MultiStructural Shape Analysis: Application to Neuroanatomy.” In International Conference on Medical Image Computing and ComputerAssisted Intervention, 523–31. Springer.
 Huang, Shikun, Binbin Zhang, Wen Shen, and Zhihua Wei. 2019. “A CLAIM Approach to Understanding the PointNet.” In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, 97–103.
 Jiang, Mingyang, Yiran Wu, and C PointSIFT Lu. 2018. “A SiftLike Network Module for 3d Point Cloud Semantic Segmentation.” In Comput. Vis. Pattern Recognit.
 Kazhdan, Michael, Thomas Funkhouser, and Szymon Rusinkiewicz. 2003. “Rotation Invariant Spherical Harmonic Representation of 3 d Shape Descriptors.” In Symposium on Geometry Processing, 6:156–64.
 Li, Yangyan, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. “Pointcnn: Convolution on xTransformed Points.” Advances in Neural Information Processing Systems 31.
 Liang, Ming, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. 2019. “MultiTask MultiSensor Fusion for 3d Object Detection.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7345–53.
 Maturana, Daniel, and Sebastian Scherer. 2015. “Voxnet: A 3d Convolutional Neural Network for RealTime Object Recognition.” In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–28. IEEE.
 Mrowca, Damian, Chengxu Zhuang, Elias Wang, Nick Haber, Li F FeiFei, Josh Tenenbaum, and Daniel L Yamins. 2018. “Flexible Neural Representation for Physics Prediction.” Advances in Neural Information Processing Systems 31.
 Qi, Charles R, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. 2018. “Frustum Pointnets for 3d Object Detection from Rgbd Data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–27.
 Qi, Charles R, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. “Pointnet: Deep Learning on Point Sets for 3d Classification and Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–60.
 Qi, Charles R, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. 2016. “Volumetric and MultiView Cnns for Object Classification on 3d Data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5648–56.
 Qi, Charles Ruizhongtai, Li Yi, Hao Su, and Leonidas J Guibas. 2017. “Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” Advances in Neural Information Processing Systems 30.
 Su, Hang, Subhransu Maji, Evangelos Kalogerakis, and Erik LearnedMiller. 2015. “MultiView Convolutional Neural Networks for 3d Shape Recognition.” In Proceedings of the IEEE International Conference on Computer Vision, 945–53.
 Thiery, Alexandre H, Fabian Braeu, Tin A Tun, Tin Aung, and Michael JA Girard. 2022. “Medical Application of Geometric Deep Learning for the Diagnosis of Glaucoma.” arXiv Preprint arXiv:2204.07004.
 Triess, Larissa T, Mariella Dreissig, Christoph B Rist, and J Marius Zöllner. 2021. “A Survey on Deep Domain Adaptation for Lidar Perception.” In 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops), 350–57. IEEE.
 Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. 2015. “Order Matters: Sequence to Sequence for Sets.” arXiv Preprint arXiv:1511.06391.
 Wang, Weiyue, Ronald Yu, Qiangui Huang, and Ulrich Neumann. 2018. “Sgpn: Similarity Group Proposal Network for 3d Point Cloud Instance Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2569–78.
 Wu, Zhirong, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. “3d Shapenets: A Deep Representation for Volumetric Shapes.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1912–20.
 Wu, Zizhao, Ruyang Shou, Yunhai Wang, and Xinguo Liu. 2014. “Interactive Shape CoSegmentation via Label Propagation.” Computers & Graphics 38: 248–54.
 Yan, Xu. 2019. “Pointnet/Pointnet++ Pytorch.” https://github.com/Yanx27/Pointnet_Pointnet2_pytorch, June.
 Yi, Li, Vladimir G Kim, Duygu Ceylan, IChao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. 2016. “A Scalable Active Framework for Region Annotation in 3d Shape Collections.” ACM Transactions on Graphics (ToG) 35 (6): 1–12.
 Yi, Li, Hao Su, Xingwen Guo, and Leonidas J Guibas. 2017. “Syncspeccnn: Synchronized Spectral Cnn for 3d Shape Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2282–90.
 Yu, Lequan, Xianzhi Li, ChiWing Fu, Daniel CohenOr, and PhengAnn Heng. 2018. “PuNet: Point Cloud Upsampling Network.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2790–99.
 yunxiaoshi. 2021. “pointnetpytorch.” https://github.com/Yunxiaoshi/PointnetPytorch, June.
 Zhang, Binbin, Shikun Huang, Wen Shen, and Zhihua Wei. 2019. “Explaining the PointNet: What Has Been Learned Inside the PointNet?” In CVPR Workshops, 71–74.
 Zhang, Weichen, Wanli Ouyang, Wen Li, and Dong Xu. 2018. “Collaborative and Adversarial Network for Unsupervised Domain Adaptation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3801–9.