1. Trang chủ
  2. » Công Nghệ Thông Tin

Deep learning for natural language processing CCF ADL 20160529

132 49 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 132
Dung lượng 8,48 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Convolutional Neural NetworkRecurrent Neural Network Recursive Neural Network 4 Challenges & Open Problems Xipeng Qiu Fudan University Deep Learning for Natural Language Processing 2 / 1

Trang 1

Deep Learning for Natural Language Processing

Xipeng Qiuxpqiu@fudan.edu.cnhttp://nlp.fudan.edu.cnhttp://nlp.fudan.edu.cnFudan University2016/5/29, CCF ADL, Beijing

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 1 / 131

Trang 2

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 2 / 131

Trang 3

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 3 / 131

Trang 4

Begin with AI

Human: Memory, Computation

Computer: Learning, Thinking, Creativity

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 4 / 131

Trang 5

Turing Test

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 5 / 131

Trang 6

Articial Intelligence

Denition from Wikipedia

Articial intelligence (AI) is the intelligence exhibited by machines

Colloquially, the term articial intelligence is likely to be applied when amachine uses cutting-edge techniques to competently perform or mimic

cognitive functions that we intuitively associate with human minds, such

as learning and problem solving

Trang 7

Challenge: Sematic Gap

Trang 8

Challenge: Sematic Gap

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 8 / 131

Trang 9

Challenge: Sematic Gap

Figure: Guernica (Picasso)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 9 / 131

Trang 10

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 10 / 131

Trang 11

Machine Learning

Model

Learning AlgorithmTraining Data: (x, y)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 11 / 131

Trang 12

Basic Concepts of Machine Learning

Input Data: (xi, yi),1 ≤ i ≤ m

Model:

Linear Model: y = f (x) = w T x + b

Generalized Linear Model: y = f (x) = w T φ(x ) + b

Non-linear Model: Neural Network

Objective Function: Q(θ) + λ kθk2

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 12 / 131

Trang 15

i = 1

yi is distribution of gold labels Thus, Eq 6 is Cross Entropy Loss function

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 15 / 131

Trang 18

i = 1

∂R θt; x(i ), y(i )

λis also called Learning Rate in ML

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 18 / 131

Trang 19

Stochastic Gradient Descent (SGD)

Trang 21

Figure: Binary Linear Classication

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 21 / 131

Trang 22

Logistic Regression

How to learn the parameter w: Perceptron, Logistic Regression, etc.The posterior probability of y = 1 is

where, σ(·) is logistic function

The posterior probability of y = 0 is P(y = 0|x) = 1 − P(y = 1|x)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 22 / 131

Trang 23

) log 1 − σ(w T x (i )

)



(16) The gradient of J (w) is

∂J ( w)

∂w =

N X

Trang 25

i =1exp(w>

To represent class c by one-hot vector

where I () is indictor function

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 25 / 131

Trang 26

z = WTx is input of softmax function.

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 26 / 131

Trang 27

Softmax Regression

Given training set (x(i ),y(i )),1 ≤ i ≤ N, the cross-entropy loss is

J (W ) = −

NX

i = 1

CX

i = 1(y(i ))Tlog ˆy(i )The gradient of J (W ) is

∂J (W )

NX

Trang 28

The idea pipeline of NLP

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 28 / 131

Trang 29

But in practice: End-to-End

Model

I like this movie.

I dislike this movie.

Trang 30

Feature Extraction

Bag-of-Word

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 30 / 131

Trang 31

Text Classication

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 31 / 131

Trang 32

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 32 / 131

Trang 33

Articial Neural Network

Articial neural networks1 (ANNs) are a family of models inspired bybiological neural networks (the central nervous systems of animals, inparticular the brain)

Articial neural networks are generally presented as systems of

interconnected neurons which exchange messages between each other

1 https://en.wikipedia.org/wiki/Articial_neural_network

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 33 / 131

Trang 36

2 X Glorot, A Bordes, and Y Bengio Deep sparse rectier neural networks In:

International Conference on Articial Intelligence and Statistics 2011, pp 315323

3 V Nair and G E Hinton Rectied linear units improve restricted boltzmann machines In:

Proceedings of the 27th International Conference on Machine Learning (ICML-10) 2010, pp 807814

4 C Dugas et al Incorporating second-order functional knowledge for better option pricing In:

Advances in Neural Information Processing Systems (2001)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 36 / 131

Trang 37

(b) tanh

0 1 2 3 4 5

(c) rectier

1 2 3 4 5

(d) softplus

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 37 / 131

Trang 38

Types of Articial Neural Network5

Feedforward neural network, also called Multilayer Perceptron (MLP).Recurrent neural network

5 https://en.wikipedia.org/wiki/Types_of_articial_neural_networks

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 38 / 131

Trang 39

Basic Concepts of Deep Learning

Model: Articial neural networks that consist of multiple hiddennon-linear layers

Function: Non-linear function y = σ(Piwixi + b)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 39 / 131

Trang 40

Feedforward Neural Network

In feedforward neural network, the information moves in only one directionforward: From the input nodes data goes through the hidden nodes (if any)and to the output nodes

There are no cycles or loops in the network

Trang 41

Feedforward Computing

Denitions:

nl Number of neurons in l-th layer;

fl(·) Activation function in l-th layer;

W(l )∈ Rnl×n l − 1

weight matrix between l − 1-th layer and l-th layer;

b(l ) ∈ Rnl bias vector between l − 1-th layer and l-th layer;

Trang 43

Combining feedforward network and Machine Learning

Given training samples (x(i ), y(i )),1 ≤ i ≤ N, and feedforward network

f (x|w, b), the objective function is

J(W ,b) =

NX

Trang 46

a(l −1) j

Trang 50

0.2 0.25

(e) logistic

− 4 − 2 0 2 4 6 0

0.2 0.4 0.6 0.8 1

(f) tanh

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 50 / 131

Trang 52

Tricks and Skills6 7

Use ReLU non-linearities

Use cross-entropy loss for classication

SDG+mini-batch

Shue the training samples ( ←− very important)

Early-Stop

Normalize the input variables (zero mean, unit variance)

Schedule to decrease the learning rate

Use a bit of L1 or L2 regularization on the weights (or a combination)Use dropout for regularization

Data Argument

6 G B Orr and K.-R M ller Neural networks: tricks of the trade Springer, 2003

7 Geo Hinton, Yoshua Bengio & Yann LeCun, Deep Learning, NIPS 2015 Tutorial.

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 52 / 131

Trang 53

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 53 / 131

Trang 54

Neural Models for Representation Learning General Architecture

General Neural Architectures for NLP

How to use neural network for the NLP tasks?

Distributed Representation

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131

Trang 55

General Neural Architectures for NLP

How to use neural network for the NLP tasks?

Distributed Representation

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131

Trang 56

General Neural Architectures for NLP8

1 represent the words/features with dense vectors (embeddings) by lookup table;

2 concatenate the vectors;

3 multi-layer neural networks classication

matching ranking

8 R Collobert et al Natural language processing (almost) from scratch In:

The Journal of Machine Learning Research (2011)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 55 / 131

Trang 57

Dierence with the traditional methods

Features (One-hot Representation) (Distributed Representation)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 56 / 131

Trang 58

The key point is

how to encode the word, phrase, sentence, paragraph, or even documentinto the distributed representation?

Representation Learning

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 57 / 131

Trang 59

Representation Learning for NLP

Hierachical Models two-level CNN

Sequence Models LSTM, Paragraph Vector

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 58 / 131

Trang 60

Let's start with Language Model

A statistical language model is a probability distribution over sequences

i = 1

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 59 / 131

Trang 61

Neural Probabilistic Language Model9

turn unsupervised learning into supervised learning;

avoid the data sparsity of n-gram model;

project each word into a low dimensional space

9 Y Bengio, R Ducharme, and P Vincent A Neural probabilistic language model In:

Journal of Machine Learning Research (2003)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 60 / 131

Trang 62

Problem of Very Large Vocabulary

Softmax output:

Pθh = exp(sθ(w , h))P

Unfortunately both evaluating Ph

likelihood gradient requires normalizing over the entire vocabulary

Hierarchical Softmax: a tree-structured vocabulary10

Negative Sampling11, noise-contrastive estimation (NCE)12

10 A Mnih and G Hinton A scalable hierarchical distributed language model In:

Advances in neural information processing systems (2009) ; F Morin and Y Bengio Hierarchical Probabilistic Neural Network Language Model. In: Aistats Vol 5 Citeseer 2005, pp 246252

11 T Mikolov et al Ecient estimation of word representations in vector space In:

arXiv preprint arXiv:1301.3781 (2013)

12 A Mnih and K Kavukcuoglu Learning word embeddings eciently with noise-contrastive estimation In: Advances in Neural Information Processing Systems 2013, pp 22652273

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 61 / 131

Trang 63

Linguistic Regularities of Word Embeddings13

13 T Mikolov, W.-t Yih, and G Zweig Linguistic Regularities in Continuous Space Word Representations. In: HLT-NAACL 2013, pp 746751

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 62 / 131

Trang 64

Skip-Gram Model14

14 Mikolov et al., Ecient estimation of word representations in vector space.

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 63 / 131

Trang 65

Skip-Gram Model

Given a pair of words (w, c), the probability that the word c is observed inthe context of the target word w is given by

Pr (D =1|w, c) = 1 + exp(−w1 Tc),where w and c are embedding vectors of w and c respectively

The probability of not observing word c in the context of w is given by,

1 + exp(−wTc).

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 64 / 131

Trang 66

Skip-Gram Model with Negative Sampling

Given a training set D, the word embeddings are learned by maximizing thefollowing objective function:

Trang 67

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 66 / 131

Trang 68

Convolutional Neural Network

Convolutional neural network (CNN, or ConvNet) is a type of

feed-forward articial neural network

Trang 69

k= 1

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 68 / 131

Trang 70

One-dimensional convolution

15 Figure from: http://cs231n.github.io/convolutional-networks/

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 69 / 131

Trang 71

u= 1

nX

Trang 72

Convolutional Layer

⊗is convolutional operation

w(l ) isshared by all the neurons of l-th layer

Just need m + 1 parameters, and n(l + 1)= n(l )− m +1

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 71 / 131

Trang 73

Fully Connected Layer V.S Convolutional Layer

(a) Fully Connected Layer

(b) Convolutional Layer

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 72 / 131

Trang 74

Pooling Layer

It is common to periodically insert a pooling layer in-between successiveconvolutional layers

progressively reduce the spatial size of the representation

reduce the amount of parameters and computation in the networkavoid overtting

For a feature map X(l ), we divide it into several (non-)overlapped regions

Rk, k =1, · · · , K A pooling function down(· · · ) is

Trang 75

Pooling Layer

= f w(l +1)·down(Xl) + b(l +1)

Two choices of down(·): Maximum Pooling and Average Pooling

poolmax(Rk) =max

i ∈R k

poolavg(Rk) = 1

|Rk|X

i ∈R k

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 74 / 131

Trang 76

Pooling Layer

15 Figure from: http://cs231n.github.io/convolutional-networks/

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 75 / 131

Trang 77

Large Scale Visual Recognition Challenge

2010-2015

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 76 / 131

Trang 78

DeepMind's AlphaGo

15 http://cs231n.stanford.edu/

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 77 / 131

Trang 79

DeepMind's AlphaGo

15 http://cs231n.stanford.edu/

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 78 / 131

Trang 80

CNN for Sentence Modeling

Input: A sentence of length n,

After Lookup layer, X = [x1,x2, · · · ,xn] ∈ Rd ×n

Trang 81

CNN for Sentence Modeling16

Key stepsconvolutionzt:t+m−1 =

xt⊕xt+1⊕xt+m−1∈ Rdmmatrix-vector operation

xl

t = f (Wlzt:t+m−1+bl)Pooling (max over time)

xl

i =maxtxl − 1

i ,t

16 Collobert et al., Natural language processing (almost) from scratch.

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 80 / 131

Trang 82

CNN for Sentence Modeling17

multiple lters / multiple channels

pooling (max over time)

Trang 83

Dynamic CNN for Sentence Modeling18

Key stepsone-dimensional convolution

xl

i ,t = f (wl

ixi ,t:t+m−1+ bil)k-max pooling (max over time)

kl =max(ktop,|L−l|L n)(optional) folding: sums every two rows

in a feature map component-wise.multiple lters / multiple channels

18 N Kalchbrenner, E Grefenstette, and P Blunsom A Convolutional Neural Network for Modelling Sentences.

In: Proceedings of ACL 2014

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 82 / 131

Trang 84

CNN for Sentence Modeling19

Trang 85

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 84 / 131

Trang 86

Recurrent Neural Network (RNN)

f (ht−1,xt) otherwise (67)

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 85 / 131

Ngày đăng: 12/04/2019, 15:33

TỪ KHÓA LIÊN QUAN