Abstract

Point cloud is a crucial geometric data type. Due to its peculiar format, most researchers transform such data. This, however, tends to cause issues. PointNet (2017) is a novel type of neural network that directly consumes point cloud data while respecting input permutation invariance, an important characteristic of point cloud data. The authors provide a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing – where PointNet shows strong performance in each, on par or even better than state of the art. Additionally, the authors provide experiments validating PointNet design choices, and theoretical analysis for understanding its empiric robustness with respect to input perturbation and corruption.

Note: This is a tex-to-markdown pandoc-transformation of a seminar report from summer semester 2022. The transformation was not 100% successful, but I fixed all links and most formatting. You can find my presentation here.


Introduction

PointNet (Charles R. Qi et al. 2017) explores a deep learning architecture capable of reasoning about 3D point cloud data. Previous architectures (Zhirong Wu et al. 2015; Maturana and Scherer 2015; Charles R. Qi et al. 2016; Su et al. 2015; Fang et al. 2015; Guo, Zou, and Chen 2015) required transformation to other representations, typically voxel grids or collections of images (e.g. views from different camera angles) before feeding them to a deep net architecture. These intermediate representations and their transformations might however obscure or modify important properties, such as invariance to geometric transformations or invariance over input order permutations. Point clouds avoid the combinatorial complexities and irregularities of meshes (e.g. from voxelization) and other intermediate representations. To maintain invariance over input order permutations and invariance over certain geometric transformations, it becomes necessary to make certain computations symmetric.

The network uses point clouds as input, and outputs class labels based on the entire input point cloud or per point. Individual points are processed identically and independently in the initial stages, from just their (x,y,z) coordinates, optionally with additional dimensions computed from normals and other local or global features, e.g., point density. The key to this approach is a single symmetric function for integrating the local features of different points to a global feature descriptor. The network effectively learns a set of criteria for interesting or informative points of the point cloud. These criteria are then combined for all points in the point cloud, and fully connected layers aggregate to a global set function approximator, which can be used for shape classification or per-point labeling (shape segmentation).

Applications of PointNet. The proposed new deep net architecture consumes raw point cloud data directly, without prior voxelization or rendering. The architecture learns both local and global features, applicable and effective for a number of 3d recognition tasks. Figure from (Charles R. Qi et al. 2017).

The paper provides a theoretical analysis and experimental evaluation of the approach. It shows that the network can approximate any continuous set function. The network summarizes an input cloud by a sparse set of key points, which roughly corresponds to the outline of objects when visualized (see Sec. 3.3.4). This explains the high robustness against small perturbations to input points and corruption through point insertion of outliers or deletion. When compared on a number of benchmark datasets, ranging from shape classification, part segmentation to scene segmentation, PointNet dominates in speed (see Table IV) and shows strong performance on par or better than state of the art at time of publication while providing a unified architecture (see Table II and Table I).

The key contributions of the PointNet paper are as follows:

  • Design of a novel deep net architecture suitable for unordered point sets in 3D;

  • Showing how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;

  • Thorough empirical and theoretical analysis on stability and efficiency of the method;

  • Illustration of the 3D features computed by the selected neurons;

  • Developing intuitive explanations for its performance.

The problem of processing unordered sets by neural nets is a very general and fundamental problem, resulting in the transferal of key ideas to many other problems and their respective domains, which explains the 7549 citations at time of writing. The paper was a trailblazer at its time and is still highly influential today.

Architecture of PointNet. The classification network takes n points as
input, applies input and feature transformations. Local features are aggregated
to global features by max pooling, and reduced to classification scores for k
classes as output. The segmentation network is an extension that concatenates
global and local features and outputs per point scores. “mlp” describes a
multi-layer perceptron. Numbers in brackets are layer sizes. All layers with
ReLU use batchnorm. Dropout layers are used for the last mlp in classification
net. Figure from (Charles R. Qi et al. 2017).

Deep Learning on 3D Data

Volumetric CNNs (Zhirong Wu et al. 2015; Maturana and Scherer 2015; Charles R. Qi et al. 2016) apply conventional 3d convolutional neural networks on voxelized shapes. However, data sparsity and computation cost of 3d convolution constrain the resolution of volumetric representation.

Multiview-CNNs (Su et al. 2015; Charles R. Qi et al. 2016) first render 3D point cloud in multiple 2D images and apply 2D conv nets for image classification. Given sufficient computational resources, they achieve dominating performance due to well-engineered image CNNs. However, it is difficult to extend image-CNNs to other 3D or point-based tasks.

Feature-based DNNs (Fang et al. 2015; Guo, Zou, and Chen 2015) extract traditional shape features and convert 3d data to a vector before using a fully connected net for shape classification. They appear to be limited by the representative power of the features extracted.

Deep Learning on Unordered Sets

A point cloud is an unordered set of vectors from the data structure point of view. Most works in deep learning however look at regular input structures like ordered sequences of images, volumes or points. Unordered point clouds are rarely considered. One recent work (Vinyals, Bengio, and Kudlur 2015) attempts to impose order on unordered input sets via the attention mechanism. This work focuses on generic sets and NLP applications, which lacks the characteristics of geometry in the sets.

Based on PointNet

There exist a number of works explaining, applying and building upon PointNet. The influence of PointNet can furthermore be seen in the ecosystem of different implementations and tools for visualization (charlesq34 2019; aldipiroli 2021; yunxiaoshi 2021; Yan 2019). Different attempts to explain what PointNet learned (B. Zhang et al. 2019; Huang et al. 2019) exist, and many apply PointNet to different domains and different problems (Thiery et al. 2022; Gutiérrez-Becker and Wachinger 2018; Triess et al. 2021; Liang et al. 2019; W. Zhang et al. 2018; Mrowca et al. 2018).

Derivations of PointNet

PointNet, in addition to being successful itself, sees successful use as a module similar to a convolution layer in more sophisticated neural network architectures. This can best be seen in recent architectures such as PointNet++ (Charles Ruizhongtai Qi et al. 2017), VoxNet (Maturana and Scherer 2015), and Syncspeccnn (Yi et al. 2017). Moreover, an even larger number of architectures is being heavily inspired by PointNet or adapts core ideas of the PointNet architecture (Jiang, Wu, and Lu 2018; Wang et al. 2018; Yu et al. 2018; Gutiérrez-Becker and Wachinger 2018; Charles R. Qi et al. 2018; Li et al. 2018). This includes architectures both for 3D point cloud data and traditionally ordered input data structures.

Problem Statement

The authors design a neural network architecture that directly consumes unordered point sets as input. A point cloud is a set of 3D points {Pi|i = 1, …, n}. Each point Pi is a vector of (x,y,z) coordinates and additional feature channels such as color, normal etc. For simplicity and clarity, unless otherwise noted, only the (x,y,z) coordinates are used as a point’s feature channels.

Input point clouds for object classification are either directly sampled from a shape face or pre-segmented from a scene point cloud. The output of PointNet are k classification scores for the k candidate classes. For semantic part region segmentation, the input is a single object or sub-volume from a 3D scene. For n input points, PointNet will output n × m scores for each of the m semantic subcategories.

  mean aero bag cap car chair ear guitar knife lamp laptop motor mug pistol rocket skate table
              phone                 board  
# shapes   2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271
Wu (Zizhao Wu et al. 2014) - 63.2 - - - 73.5 - - - 74.4 - - - - - - 74.8
Yi (Yi et al. 2016) 81.4 81.0 78.4 77.7 75.7 87.6 61.9 92.0 85.4 82.5 95.7 70.6 91.9 85.9 53.1 69.8 75.3
3DCNN 79.4 75.1 72.8 73.3 70.0 87.2 63.5 88.4 79.6 74.4 93.9 58.7 91.8 76.4 51.2 65.3 77.1
PointNet 83.7 83.4 87.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6

Table I: Segmentation results on ShapeNet part dataset. The metric used is mIoU(%) on points. As a baseline, we compare with a 3D fully convolutional network pipeline proposed by us as well as two traditional methods (Zizhao Wu et al. 2014) and (Yi et al. 2016). The authors observe that PointNet achieved a new state-of-the-art in mIoU. Table from (Charles R. Qi et al. 2017).

Deep Learning on Point Sets

The architecture of PointNet (Sec. 3.3) is guided by the characteristics of point sets in ℝn (Sec. 2.1).

Properties of Point Sets in ℝn

Unordered

Contrary to most regular data structures, which depend on order, point cloud is a set of points without specific order. A network that consumes 3D point sets of size N needs to be invariant to permutations in input data order.

Interaction among points

Similar to more regular data structures however, albeit on the topological level, points are not isolated and form local and global structures where neighboring points form a meaningful subset. Hence, the model needs to be able to recognize the spatial relations and importance of points.

Invariance under transformations

The important information of point cloud data is the spatial relation between points and their relative location. Thus, point cloud data is inherently invariant to geometric transformations such as translation or rotation, and neither should modify point cloud classification and segmentation results.

PointNet Architecture

The Architecture of PointNet and its pipeline is explained in Figure 2 and its caption. The three most important parts are, first, the data-dependent canonicalization through two T-Nets, generating a 3 × 3 and 64 × 64 orthogonal affine transformation matrix which is applied to each data point individually. Second, a local and global information combination structure for segmentation, in which local features are fused with global features of the point cloud. Third, and most importantly, the max pooling layer is used as a symmetric function to aggregate information from all the points, staying invariant over permutations to input order.

Effectively, each T-Net is a mini-PointNet with multiple MLP layers and max pooling, taking the raw point cloud as input and regressing to an 3 × 3 and 64 × 64 affine transformation matrix for T1 and T2 respectively. A small regularization loss is added to the softmax classification loss to make the matrix close to orthogonal. See Sec. 3.3.2 and supplementary material for details.

Theoretical Analysis

The authors further show the universal approximation ability of the PointNet neural network to continuous set functions. The detailed proof for this is provided in the supplementary material to (Charles R. Qi et al. 2017). Relevant insights include the following:

  • In the worst case, the network can learn to convert a point cloud to a volumetric representation, though visualization shows that the network learns a much smarter strategy.

  • Intuitively, since the set function is continuous, a small perturbation to the input set should barely affect the function values, e.g., classification or segmentation scores.

  • By implication of using max  over all local feature vectors to obtain the global feature vector, a finite subset of input points called critical point set fully determines the result of a classification with PointNet.

When combined, this explains the robustness seen w.r.t. point perturbation, corruption and extra noise points (see Sec. 3.3.3 on robustness tests). Intuitively, the network summarizes a shape by a sparse set of key points (see Sec. 3.3.4 for details and visualization).

  input #views accuracy accuracy
      avg. class overall
SPH (Kazhdan, Funkhouser, and Rusinkiewicz 2003) mesh - 68.2 -
3DShapeNets (Zhirong Wu et al. 2015) volume 1 77.3 84.7
VoxNet (Maturana and Scherer 2015) volume 12 83.0 85.9
Subvolume (Charles R. Qi et al. 2016) volume 20 86.0 89.2
LFD (Zhirong Wu et al. 2015) image 10 75.5 -
MVCNN (Su et al. 2015) image 80 90.1 -
Custom baseline point - 72.6 77.4
PointNet point 1 86.2 89.2

Table II: Classification results on ModelNet40. PointNet achieves state-of-the-art among deep nets on 3D input. Table from (Charles R. Qi et al. 2017).

Experiments

Experiments are divided in four parts: standard benchmark comparison for classification (A) and part segmentation (B) tasks, empirical validation of architectural design choices (C), and analysis of time and space complexity (D).

3D Object Classification

As seen in Sec. 2.3, PointNet is a general set function approximator and can thus be trained for classification. Comparison with then-state-of-the-art on the ModelNet40 (Zhirong Wu et al. 2015) dataset can be found in Table IV. Previous methods focused on volumetric and multi-view image representations, while PointNet is the first to directly work on raw point cloud.

1024 points are uniformly sampled on mesh faces and normalized into a unit sphere. Augmentation during training is done by randomly rotating the object and adding Gaussian noise to the position of coordinates with zero mean and a standard deviation of 0.02.

The baseline is using MLP on traditional features extracted from point cloud (point density, D2, shape contour etc.). The authors think the small gap to multi-view based methods (MVCNN (Su et al. 2015)) is due to more granular geometric details that can be captured by rendered images.

3D Object Part Segmentation

Part segmentation is a demanding recognition task. Based on a 3D scan or mesh, the task is to assign a part category label (e.g. chair leg, cup handle) to each point or face. The authors evaluate performance on ShapeNet (Yi et al. 2016), with contains labels for 50 parts in 16 different categories, where most categories are labeled with two to five parts.

Part segmentation is formulated as a per-point classification problem, with the evaluation metric mIoU (mean Intersection over Union) on points. Due to the nature of geometric intersections, values above 0.5 are traditionally seen as accurate classification, though generally higher is better.

The authors compare PointNet with two traditional methods (Zhirong Wu et al. 2015; Yi et al. 2016) taking advantage of point-wise geometry features, as well as a custom 3D CNN baseline. See supplementary materials for details on modifications and architecture for the 3D CNN. Per-category and mean IoU(%) scores can be found in Table I. The authors observe that PointNet beats the selected reference methods in most categories.

Architecture Design Analysis

In this section, we validate architectural design choices with empirical results.

Comparison with Alternative Order-invariant Methods

Experiments on alternative methods for order-invariance showed that max  has empirically the best results (accuracy of 87.1) as the symmetric function used when compared with alternatives evaluated on the ModelNet40 shape classification problem. Alternatives include $\verb!avg!$-pooling (83.8), attention sum (83.0), LSTMs (78.5), and MLPs with sorted (45.0) and unsorted (24.2) input (Vinyals, Bengio, and Kudlur 2015). For more information, see Section 5.2 of (Charles R. Qi et al. 2017).

Transform accuracy
none 87.1
input (3x3) 87.9
feature (64x64) 86.9
feature (64x64) + reg. 87.4
both 89.2

Table III: Effects of input feature transforms. Based on overall classification accuracy on the ModelNet40 (Zhirong Wu et al. 2015) test set. Table from (Charles R. Qi et al. 2017).

Effectiveness of Input and Feature Transformations

Interestingly, as can be seen in Table III, even the most basic PointNet architecture without T-Net transforms gives decent results. Additional input and feature transformations with regularization towards an orthogonal transformation matrix (see (Charles R. Qi et al. 2017) and supplemental material for details) provides an accuracy improvement of 2.1%.

 Robustness tests visualization. Accuracy measured is overall classification accuracy on the ModelNet40 test set. Left: Point deletion based on different sampling strategies, random and furthest sampling, up to the 1024 original points. Middle: Insertion of uniformly scattered outliers in the canonicalized unit sphere of the original shape. Right: Adding perturbation in the form of Gaussian noise with mean 0 to each point in the unit sphere independently. Figure from (Charles R. Qi et al. 2017).

Robustness Test

PointNet, while simple and effective, is robust against both removing points and adding noise, as can be seen by barely decreased accuracy on ModelNet40 under various corruptions in Figure 3. Training with additional point density information (XYZ+density) increases accuracy for inputs with less than 30% outliers. Even without it, the net has 80% accuracy even when 20% of the points are outliers. The reasons for losing only small bits of accuracy (2.4% on furthest sampling, 3.8% on random sampling with 50% points missing) is explored in the next part, Sec. 3.3.4. Additional experiments on partial data via simulated Kinect scans and complete ShapeNet CAD models show that albeit partial data is rather challenging, predictions are robust and reasonable (for details see (Charles R. Qi et al. 2017)).

Critical points and upper bound shape. The critical point set together fully determines the global feature description for a given shape. Any point cloud between the critical point set and the upper bound shape gives exactly the same feature description. All figures are color-coded to provide depth information. Figure from (Charles R. Qi et al. 2017).

Critical points and upper bound shape

Previously noted in Sec. 2.3, the network summarizes a shape by a sparse set of key points, called critical point set. This is due to the continuity of local feature functions, and the propensity of max  to ignore the values of local features when lower than those provided by the critical point set. Adding points with local feature values between those of the critical point set generates the upper bound shape. Examples for both of these sets and the original shape are visualized in Figure 4. As can be seen in the visualization, the critical point set roughly corresponds to the skeleton of objects.

  #params FLOPs/sample
PointNet (vanilla) 0.8M 148M
PointNet 3.5M 440M
Subvolume (Charles R. Qi et al. 2016) 16.6M 3633M
MVCNN (Su et al. 2015) 60.0M 62057M

Table IV: Time and space complexity of different deep learning architectures for 3D data classification. PointNet (vanilla) is the classification PointNet without input and feature T-Net transformation networks. FLOP is floating-point operations. The “M” stands for a million units. Both Subvolume and MVCNN used input data pooling from multiple rotations or views, without which they have much inferior performance. Table from (Charles R. Qi et al. 2017).

Time and Space Complexity Analysis

PointNet, due to its simple architecture and few parameters, can process more than one million points per second for point cloud classification (around 1K objects/second) or semantic segmentation (around 2 rooms/second) with a 1080X GPU on TensorFlow, which enables usage in real-time application. Previous methods required substantially more computation (see Table IV), both for evaluation but especially for training, due to the higher number of parameters. Trading a few percent in accuracy for a computation speedup of about 3 thus makes PointNet (vanilla) a strong candidate for complex real-time and real-world applications.

Conclusion and Outlook

PointNet is a newly proposed deep neural network architecture directly consuming point cloud data. This enables a unified approach to a number of 3D recognition tasks, including object classification, part segmentation and semantic segmentation. On each of these tasks, results on par or better than state of the art are obtained on standard benchmarks. Additionally, theoretical analysis and experimental validation of architectural design choices provide a deeper understanding.

Through its characteristic of a universal continuous set function approximator, PointNet sees usage as a module in other architectures for 3D point cloud data. Additionally, the core ideas of PointNet find successful adaptation by many architectures not just for 3D point cloud data, but also for many traditionally ordered input data structures.


Sources

  • Fang, Yi, Jin Xie, Guoxian Dai, Meng Wang, Fan Zhu, Tiantian Xu, and Edward Wong. 2015. “3d Deep Shape Descriptor.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2319–28.
  • Guo, Kan, Dongqing Zou, and Xiaowu Chen. 2015. “3d Mesh Labeling via Deep Convolutional Neural Networks.” ACM Transactions on Graphics (TOG) 35 (1): 1–12.
  • Gutiérrez-Becker, Benjamı́n, and Christian Wachinger. 2018. “Deep Multi-Structural Shape Analysis: Application to Neuroanatomy.” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 523–31. Springer.
  • Huang, Shikun, Binbin Zhang, Wen Shen, and Zhihua Wei. 2019. “A CLAIM Approach to Understanding the PointNet.” In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, 97–103.
  • Jiang, Mingyang, Yiran Wu, and C PointSIFT Lu. 2018. “A Sift-Like Network Module for 3d Point Cloud Semantic Segmentation.” In Comput. Vis. Pattern Recognit.
  • Kazhdan, Michael, Thomas Funkhouser, and Szymon Rusinkiewicz. 2003. “Rotation Invariant Spherical Harmonic Representation of 3 d Shape Descriptors.” In Symposium on Geometry Processing, 6:156–64.
  • Li, Yangyan, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. “Pointcnn: Convolution on x-Transformed Points.” Advances in Neural Information Processing Systems 31.
  • Liang, Ming, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. 2019. “Multi-Task Multi-Sensor Fusion for 3d Object Detection.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7345–53.
  • Maturana, Daniel, and Sebastian Scherer. 2015. “Voxnet: A 3d Convolutional Neural Network for Real-Time Object Recognition.” In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–28. IEEE.
  • Mrowca, Damian, Chengxu Zhuang, Elias Wang, Nick Haber, Li F Fei-Fei, Josh Tenenbaum, and Daniel L Yamins. 2018. “Flexible Neural Representation for Physics Prediction.” Advances in Neural Information Processing Systems 31.
  • Qi, Charles R, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. 2018. “Frustum Pointnets for 3d Object Detection from Rgb-d Data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–27.
  • Qi, Charles R, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. “Pointnet: Deep Learning on Point Sets for 3d Classification and Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–60.
  • Qi, Charles R, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. 2016. “Volumetric and Multi-View Cnns for Object Classification on 3d Data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5648–56.
  • Qi, Charles Ruizhongtai, Li Yi, Hao Su, and Leonidas J Guibas. 2017. “Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” Advances in Neural Information Processing Systems 30.
  • Su, Hang, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. “Multi-View Convolutional Neural Networks for 3d Shape Recognition.” In Proceedings of the IEEE International Conference on Computer Vision, 945–53.
  • Thiery, Alexandre H, Fabian Braeu, Tin A Tun, Tin Aung, and Michael JA Girard. 2022. “Medical Application of Geometric Deep Learning for the Diagnosis of Glaucoma.” arXiv Preprint arXiv:2204.07004.
  • Triess, Larissa T, Mariella Dreissig, Christoph B Rist, and J Marius Zöllner. 2021. “A Survey on Deep Domain Adaptation for Lidar Perception.” In 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops), 350–57. IEEE.
  • Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. 2015. “Order Matters: Sequence to Sequence for Sets.” arXiv Preprint arXiv:1511.06391.
  • Wang, Weiyue, Ronald Yu, Qiangui Huang, and Ulrich Neumann. 2018. “Sgpn: Similarity Group Proposal Network for 3d Point Cloud Instance Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2569–78.
  • Wu, Zhirong, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. “3d Shapenets: A Deep Representation for Volumetric Shapes.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1912–20.
  • Wu, Zizhao, Ruyang Shou, Yunhai Wang, and Xinguo Liu. 2014. “Interactive Shape Co-Segmentation via Label Propagation.” Computers & Graphics 38: 248–54.
  • Yi, Li, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. 2016. “A Scalable Active Framework for Region Annotation in 3d Shape Collections.” ACM Transactions on Graphics (ToG) 35 (6): 1–12.
  • Yi, Li, Hao Su, Xingwen Guo, and Leonidas J Guibas. 2017. “Syncspeccnn: Synchronized Spectral Cnn for 3d Shape Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2282–90.
  • Yu, Lequan, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. 2018. “Pu-Net: Point Cloud Upsampling Network.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2790–99.
  • Zhang, Binbin, Shikun Huang, Wen Shen, and Zhihua Wei. 2019. “Explaining the PointNet: What Has Been Learned Inside the PointNet?” In CVPR Workshops, 71–74.
  • Zhang, Weichen, Wanli Ouyang, Wen Li, and Dong Xu. 2018. “Collaborative and Adversarial Network for Unsupervised Domain Adaptation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3801–9.