2020 한국인공지능학회 동계강좌 정리 – 5. AITrics 이주호 박사님, Set-input Neural Networks and Amortized Clustering

This entry is part 5 of 9 in the series 2020 한국인공지능학회 동계강좌

 

2020 인공지능학회 동계강좌를 신청하여 2020.1.8 ~ 1.10 3일 동안 다녀왔다. 총 9분의 연사가 나오셨는데, 프로그램 일정은 다음과 같다.

전체를 묶어서 하나의 포스트로 작성하려고 했는데, 주제마다 내용이 꽤 많을거 같아, 한 강좌씩 시리즈로 묶어서 작성하게 되었다. 다섯 번째 포스트에서는 AITrics 이주호 박사님의 “Set-input Neural Networks and Amortized Clustering” 강연 내용을 다룬다.

  1. Introduction
    1. Common assumption
      • Typical machine learning algorithms act on fixed dimensional data
      • How about RNN ?
        • Maybe variable-length, but in fixed-order (order matters)
        • Repeated application of neural networks for fixed-length inputs
    2. Challenging tasks
      • Point-cloud classification
      • Multiple instance learning
      • Few-shot image classification
    3. Set-input problems
      • Inputs
        • variable length
        • order does not matter
        • In other words, inputs are sets X = \{ x_1, ..., x_n \}
      • Today’s topic
        • y = f(x;\theta)
        • \theta* = arg min E_{p(x,y)}[L(f(x;\theta),y)]
  2. DeepSets
    1. Permutation invariance and equivariance
      • A function f: 2^x \rightarrow y is permutation invariant if
        f(x_1, ..., x_n) = f(x_{\pi(1)}, ..., x_{\pi(n)})
      • A function f: x^n \rightarrow y^n is permutation equivariant if
        f(x_1, ..., x_n)=[f_1(x_1, ..., x_n), ..., f_n(x_1, ..., x_n)],
        f(x_{\pi(1)}, ..., x_{\pi(n)})= [f_{\pi(1)}(x_1,..., x_n), ..., f_{\pi(n)}(x_1, ..., x_n)]
    2. DeepSets
      • Zaheer et al, Deep Sets, NIPS 2017
      • Wagstaff et al, On the Limitations of Representing Functions on Sets, arXiv 2019
      • \rho and \phi : any feedforward neural network
        f(x) = \rho(\sum_{x \in X} \phi(x))
      • Are DeepSets expressive enough?
        • Let f:2^x \rightarrow y be a function on a countable x Then f is permutation invariant if and only if it is sum-decomposable [Zaheer et al, 2017]
        • If x is uncountable, then there exists a function 2^x \rightarrow R which are not sum-decomposable [Wagstaff et al, 2019]
        • Let f: R^{\leq n} \rightarrow R be a continuous function. Then f is permutation-invariant if and only if it is sum-decomposable with \phi(\cdot) \in R^n
    3. Examples for DeepSet-like architectures
      1. Prototypical networks
        • Snell et al, Prototypical Networks for Few-shot Learning, NIPS 2017
      2. Neural statistician
        • Edwards and Storkey, Towards a Neural Statistician, ICLR 2017
      3. Neural process
        • Garnelo, Neural Processes, ICML 2018
          • Can we learn everything from data, without assuming Gaussianity?
      4. PointFlow
        • Yang et al, PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows, ICCV 2019
    4. Limitations of DeepSets
      1. Sum-decomposition does not take into account the interactions between elements in a set
      2. One can easily think of a problem that modeling interactions between elements is crucial
  3. Amortized clustering
    1. Clustering methods
      • K-means clustering, spectral clustering, hierarchical clustering
      • Soft K-means, Mixture of Gaussians
      • Randomly initialize and iteratively refine cluster assignments
      • Clustering with mixture of Gaussians
        • Expectation-maximization (EM) algorithm to maximize the log-likelihood
      • What if we have a batch of datasets to cluster?
    2. Clustering as a set-input problem
      • Input : X = \{ x_1, ..., x_n \}
      • Output : f(X) = (\pi, \mu, \sigma )
    3. Amortized inference in deep learning context
      • Express some non-trivial inference procedures as neural network evaluations. Knowledge from previous inferences are distilled in parameters of the neural networks
      • Old-school variational inference
        • q(z|x) are optimized for each x_i
          \prod^n_{i=1} q(z_i | x_i; \phi) = \prod^n_{i=1} N(x_i;\mu_i, \sigma^2_i)
      • Amortized variational inference
        • q(z_i | x_i ; \phi) = N(z_i; \mu_{\phi}(x_i), \sigma^2_{\phi}(x_i))
        • the inference procedure (approximating p(z_i | x_i;\theta) with q(z_i ; x_i;\phi) ) is expressed with the neural network evaluations ( \mu_{\phi}(\cdot), \sigma_{\phi}(\cdot))
    4. Amortized clustering
      • Reuse the previous inference (clustering) results to accelerate the inference (clustering) of new dataset.
    5. Solving amortized clustering for mixture of Gaussians
      • Build a permutation invariant f(\cdot;\phi) such as DeepSet architecture
      • DeepSets does not work well
        • Interaction between elements is crucial for clustering
  4. Set Transformer
    1. Attention mechanism
      • Vaswani et al, Attention Is All You Need, NIPS 2017
      • Dot-product attention mechanism
        • Easy and efficient way to compute interactions between sets
          O = Att(X,Y) = softmax(\dfrac{QK^T}{\sqrt{d}})V 
      • Multihead QKV attention
        • Concatenate multiple attention with different weights to get more robust results
      • Self-attention
        • Attending to itself
        • efficiently encode pairwise interactions between elements in a set
        • self-attention is permutation equivariant
    2. Set Transformer
      • Lee et al, Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, ICML 2019
      • Permutation invariant neural network based on self-attention
      • Composed of a variant of transformer layer [Vaswani et al., 2017 ] that performs a permutation equivariant operation on sets
      • Building blocks
        • Multihead attention block (MAB)
        • Self-attention block (SAB)
        • Induced self-attention block (ISAB) : reduces time complexity to O(nm)
        • Encoder
          1. roughly matches to \phi(x) in DeepSets, transforms elements in a set into features in a latent space
          2. Composed of a stack of ISABs
          3. Efficient computation due to inducing points
        • Decoder
          1. Pooling by multihead attention (PMA) : instead of a simple sum/max/min aggregation, use multihead attention to aggregate features into a single vector
          2. Set input problems often require to produce multiple dependent outputs. (e.g. amortized clustering) In such case we utilize the self-attention mechanism in aggregation process
      • Properties
        • A Set transformer is at least as expressive as DeepSets, since it includes DeepSets as its special cases.
        • Nothing known about universality (but can represent any set input functions that a DeepSet can represent )
    3. Amortized clustering revisited
      • Set transformer solves amortized clustering accurately
    4.  Applications
      • Unique character counting
      • Anomaly Detection
    5. More complex clustering problems
      • Lee et al, Deep Amortized Clustering, arXiv 2019
        • Handling variable number of clusters
        • Handling non-trivial cluster shapes

 

Series Navigation<< 2020 한국인공지능학회 동계강좌 정리 – 4. KAIST 신진우 교수님, Adversarial Robustness of DNN2020 한국인공지능학회 동계강좌 정리 – 6. KAIST 양은호 교수님, Deep Generative Models >>

Leave a Comment