2020 한국인공지능학회 동계강좌 정리 – 9. 연세대 황승원 교수님, Knowledge in Neural NLP

2020 인공지능학회 동계강좌를 신청하여 2020.1.8 ~ 1.10 3일 동안 다녀왔다. 총 9분의 연사가 나오셨는데, 프로그램 일정은 다음과 같다.

전체를 묶어서 하나의 포스트로 작성하려고 했는데, 주제마다 내용이 꽤 많을거 같아, 한 강좌씩 시리즈로 묶어서 작성하게 되었다. 마지막 아홉 번째 포스트에서는 연세대 황승원 교수님의 “Knowledge in Neural NLP” 강연 내용을 다룬다.


  1. Basic of NLP
    1. Neural NLP in one hour (This section is based on the lecture below)
      • http://videolectures.net/DLRLsummerschool2018_neubig_language_understanding/
      • Main categories of NLP tasks
        • Language modeling : P(text)
          • “Does this sound like good English/French/…?”
        • Text classification : P(label | text)
          • “Is this review a positive review?”
          • “What topic does this article belong to?”
        • Sequence transduction : P(text | text)
          • “How do you say this in Japanese?”
          • “How do you respond to this comment?”
        • Language Analysis : P(labels/tree/graph | text)
          • “Is this word a person, place or thing?”
          • “What is the syntactic structure of this sentence?”
          • “What is the latent meaning of this sentence?”
      • Language Model
        • Phenomena to handle
          • Morphology / Syntax / Semantics / Discourse / Pragmatics / Multilinguality, …
          • Neural network gives us a flexible tool to handle these
      • Sentence classification
        • Sentiment Analysis : very good / good / neutral / bad / very bad
          • I hate this movie => “very bad”
          • I love this movie => “very good”
        • A first Try : Bag of Words (BoW)
          • each word has its own 5 elements corresponding to [very good, good, neutral, bad, very bad]
          • But there are some problems to extract combination features
            • I don’t love this movie.
          • Continuous Bag of Words (CBoW)
        • With n-grams
          • Allow us to capture combination features in a simple way
          • ex) “don’t love”, “not the best”
          • Problem
            • Leads to sparsity : many of the n-grams will never be seen in a training corpus
            • No sharing between similar words / n-grams
          • books.google.com/ngrams
        • Another approaches
          • Time Delay Neural Networks
          • CNN
            • Generally 1D convolution \approx Time Delay Neural Network (TDNN)
            • CNN are great for short-distance feature extractors
            • Don’t have holistic view of the sentence
        • RNN
          • To remember Long-distance dependencies
          • Weakness
            • Indirect passing of information => Made better by LSTM/GRU but not perfect
            • Can be slow
        • Count-based Model
          • Count up the frequency and divide
            P_{ML}(x_i|x_{i-n+1}, ..., x_{i-1}) := \dfrac{c(x_{i-n+1}, ..., x_i)}{c(x_{i-n+1}, ..., x_{i-1})}
          • Add smoothing, to deal with zero counts
             P_{ML}(x_i|x_{i-n+1}, ..., x_{i-1}) := \lambda P_{ML}(x_i | x_{i-n+1}, ..., x_{i-1}) + (1-\lambda)P(x_i|x_{1-n+2}, ..., x_{i-1})
          • Problems
            • Cannot share strength among similar words
            • Cannot condition on context with intervening words
            • Cannot handle long-distance dependencies
        • Featurized Models
          • Calculate features of the context
          • Based on the features, calculate probabilities
        • Linear Models can’t learn feature combinations
        • Neural Language Models
          • Strength
            • Similar output words get similar rows in the softmax matrix
            • Similar contexts get similar hidden states
            • Word embeddings : Similar input words get similar vectors
      • Sequence Transduction
        • Conditioned Language Models
          • Input X / Output Y (Text) / Task
            • Structured Data / NL Description / NL Generation
            • English / Japanese / Translation
            • Document / Short Description / Summarization
            • Utterance / Response / Response Generation
            • Image / Test / Image Captioning
            • Speech / Transcript / Speech Recognition
            • Dialog ( Multi-turn
        • Attention
          • Encode each word in the sentence into a vector
          • When decoding, perform a linear combination of these vectors, weighted by “attention weights”
          • Use this combination in picking the next word
    2. Learn More
      • Stanford CS224n, [Link]
      • CMU CS11-747, [Link]
  2. Knowledge in NLP
    1. NLP examples
      • Q & A
      • Information Extraction
      • Dialogue : Smart Reply, but very limited
      • Sentiment Analysis : This is either a research area or prediction area such like stock price prediction.
    2. Entity2Topic Module
      • By using Entity Linking, one can insert additional information to improve the performance of abstractive summarization models
      • Tips before attention
        • Remove ambiguity on information
        • Eliminate unnecessary information
        • Emphasize more important information
    3. Sentence Classification
      • Problem
        • Context sparsity : Features should be extracted from a single sentence
      • Domain-dependent Solution : This is not easy.
        • Look Neighboring sentences
        • Refer other reviews
        • Seek Latent topics
      • Translation as CS contexts
        • Always available
        • Ambiguity Resolution
        • Extensible Context
        • Better Classifying in WordVec
    4. Beyond BERT
      • The BERT is powerful, but has limitations on self-training
        • Human reporting bias ( trivial things can not be reported)
        • Weak for Adversarial examples
      • Considerations
        • Use additional information on Entities
        • Knowledge Encoding
        • Knowledge Distillation
        • Augmentation
        • Paraphrase
        • Supervision of Attention
