Deep learning for natural language processing CCF ADL 20160529

Convolutional Neural NetworkRecurrent Neural Network Recursive Neural Network 4 Challenges & Open Problems Xipeng Qiu Fudan University Deep Learning for Natural Language Processing 2 / 1

Trang 1

Deep Learning for Natural Language Processing

Xipeng Qiuxpqiu@fudan.edu.cnhttp://nlp.fudan.edu.cnhttp://nlp.fudan.edu.cnFudan University2016/5/29, CCF ADL, Beijing

Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 1 / 131

Trang 2

Convolutional Neural Network

Recurrent Neural Network

Recursive Neural Network

4 Challenges & Open Problems

Trang 3

Trang 4

Begin with AI

Human: Memory, Computation

Computer: Learning, Thinking, Creativity

Trang 5

Turing Test

Trang 6

Articial Intelligence

Denition from Wikipedia

Articial intelligence (AI) is the intelligence exhibited by machines

Colloquially, the term articial intelligence is likely to be applied when amachine uses cutting-edge techniques to competently perform or mimic

cognitive functions that we intuitively associate with human minds, such

as learning and problem solving

Trang 7

Challenge: Sematic Gap

Trang 8

Trang 9

Figure: Guernica (Picasso)

Trang 10

Trang 11

Machine Learning

Model

Learning AlgorithmTraining Data: (x, y)

Trang 12

Basic Concepts of Machine Learning

Input Data: (xi, yi),1 ≤ i ≤ m

Model:

Linear Model: y = f (x) = w T x + b

Generalized Linear Model: y = f (x) = w T φ(x ) + b

Non-linear Model: Neural Network

Objective Function: Q(θ) + λ kθk2

Trang 15

i = 1

yi is distribution of gold labels Thus, Eq 6 is Cross Entropy Loss function

Trang 18

i = 1

∂R θt; x(i ), y(i )

λis also called Learning Rate in ML

Trang 19

Stochastic Gradient Descent (SGD)

Trang 21

Figure: Binary Linear Classication

Trang 22

Logistic Regression

How to learn the parameter w: Perceptron, Logistic Regression, etc.The posterior probability of y = 1 is

where, σ(·) is logistic function

The posterior probability of y = 0 is P(y = 0|x) = 1 − P(y = 1|x)

Trang 23

) log 1 − σ(w T x (i )

)

(16) The gradient of J (w) is

∂J ( w)

∂w =

N X

Trang 25

i =1exp(w>

To represent class c by one-hot vector

where I () is indictor function

Trang 26

z = WTx is input of softmax function.

Trang 27

Softmax Regression

Given training set (x(i ),y(i )),1 ≤ i ≤ N, the cross-entropy loss is

J (W ) = −

NX

i = 1

CX

i = 1(y(i ))Tlog ˆy(i )The gradient of J (W ) is

∂J (W )

NX

Trang 28

The idea pipeline of NLP

Trang 29

But in practice: End-to-End

Model

I like this movie.

I dislike this movie.

Trang 30

Feature Extraction

Bag-of-Word

Trang 31

Text Classication

Trang 32

Trang 33

Articial Neural Network

Articial neural networks1 (ANNs) are a family of models inspired bybiological neural networks (the central nervous systems of animals, inparticular the brain)

Articial neural networks are generally presented as systems of

interconnected neurons which exchange messages between each other

1 https://en.wikipedia.org/wiki/Articial_neural_network

Trang 36

2 X Glorot, A Bordes, and Y Bengio Deep sparse rectier neural networks In:

International Conference on Articial Intelligence and Statistics 2011, pp 315323

3 V Nair and G E Hinton Rectied linear units improve restricted boltzmann machines In:

Proceedings of the 27th International Conference on Machine Learning (ICML-10) 2010, pp 807814

4 C Dugas et al Incorporating second-order functional knowledge for better option pricing In:

Advances in Neural Information Processing Systems (2001)

Trang 37

(b) tanh

0 1 2 3 4 5

(c) rectier

1 2 3 4 5

(d) softplus

Trang 38

Types of Articial Neural Network5

Feedforward neural network, also called Multilayer Perceptron (MLP).Recurrent neural network

5 https://en.wikipedia.org/wiki/Types_of_articial_neural_networks

Trang 39

Basic Concepts of Deep Learning

Model: Articial neural networks that consist of multiple hiddennon-linear layers

Function: Non-linear function y = σ(Piwixi + b)

Trang 40

Feedforward Neural Network

In feedforward neural network, the information moves in only one directionforward: From the input nodes data goes through the hidden nodes (if any)and to the output nodes

There are no cycles or loops in the network

Trang 41

Feedforward Computing

Denitions:

nl Number of neurons in l-th layer;

fl(·) Activation function in l-th layer;

W(l )∈ Rnl×n l − 1

weight matrix between l − 1-th layer and l-th layer;

b(l ) ∈ Rnl bias vector between l − 1-th layer and l-th layer;

Trang 43

Combining feedforward network and Machine Learning

Given training samples (x(i ), y(i )),1 ≤ i ≤ N, and feedforward network

f (x|w, b), the objective function is

J(W ,b) =

NX

Trang 46

a(l −1) j

Trang 50

0.2 0.25

(e) logistic

− 4 − 2 0 2 4 6 0

0.2 0.4 0.6 0.8 1

(f) tanh

Trang 52

Tricks and Skills6 7

Use ReLU non-linearities

Use cross-entropy loss for classication

SDG+mini-batch

Shue the training samples ( ←− very important)

Early-Stop

Normalize the input variables (zero mean, unit variance)

Schedule to decrease the learning rate

Use a bit of L1 or L2 regularization on the weights (or a combination)Use dropout for regularization

Data Argument

6 G B Orr and K.-R M ller Neural networks: tricks of the trade Springer, 2003

7 Geo Hinton, Yoshua Bengio & Yann LeCun, Deep Learning, NIPS 2015 Tutorial.

Trang 53

Trang 54

Neural Models for Representation Learning General Architecture

General Neural Architectures for NLP

How to use neural network for the NLP tasks?

Distributed Representation

Trang 55

General Neural Architectures for NLP

How to use neural network for the NLP tasks?

Distributed Representation

Trang 56

General Neural Architectures for NLP8

1 represent the words/features with dense vectors (embeddings) by lookup table;

2 concatenate the vectors;

3 multi-layer neural networks classication

matching ranking

8 R Collobert et al Natural language processing (almost) from scratch In:

The Journal of Machine Learning Research (2011)

Trang 57

Dierence with the traditional methods

Features (One-hot Representation) (Distributed Representation)

Trang 58

The key point is

how to encode the word, phrase, sentence, paragraph, or even documentinto the distributed representation?

Representation Learning

Trang 59

Representation Learning for NLP

Hierachical Models two-level CNN

Sequence Models LSTM, Paragraph Vector

Trang 60

Let's start with Language Model

A statistical language model is a probability distribution over sequences

i = 1

Trang 61

Neural Probabilistic Language Model9

turn unsupervised learning into supervised learning;

avoid the data sparsity of n-gram model;

project each word into a low dimensional space

9 Y Bengio, R Ducharme, and P Vincent A Neural probabilistic language model In:

Journal of Machine Learning Research (2003)

Trang 62

Problem of Very Large Vocabulary

Softmax output:

Pθh = exp(sθ(w , h))P

Unfortunately both evaluating Ph

likelihood gradient requires normalizing over the entire vocabulary

Hierarchical Softmax: a tree-structured vocabulary10

Negative Sampling11, noise-contrastive estimation (NCE)12

10 A Mnih and G Hinton A scalable hierarchical distributed language model In:

Advances in neural information processing systems (2009) ; F Morin and Y Bengio Hierarchical Probabilistic Neural Network Language Model. In: Aistats Vol 5 Citeseer 2005, pp 246252

11 T Mikolov et al Ecient estimation of word representations in vector space In:

arXiv preprint arXiv:1301.3781 (2013)

12 A Mnih and K Kavukcuoglu Learning word embeddings eciently with noise-contrastive estimation In: Advances in Neural Information Processing Systems 2013, pp 22652273

Trang 63

Linguistic Regularities of Word Embeddings13

13 T Mikolov, W.-t Yih, and G Zweig Linguistic Regularities in Continuous Space Word Representations. In: HLT-NAACL 2013, pp 746751

Trang 64

Skip-Gram Model14

14 Mikolov et al., Ecient estimation of word representations in vector space.

Trang 65

Skip-Gram Model

Given a pair of words (w, c), the probability that the word c is observed inthe context of the target word w is given by

Pr (D =1|w, c) = 1 + exp(−w1 Tc),where w and c are embedding vectors of w and c respectively

The probability of not observing word c in the context of w is given by,

1 + exp(−wTc).

Trang 66

Skip-Gram Model with Negative Sampling

Given a training set D, the word embeddings are learned by maximizing thefollowing objective function:

Trang 67

Trang 68

Convolutional neural network (CNN, or ConvNet) is a type of

feed-forward articial neural network

Trang 69

k= 1

Trang 70

One-dimensional convolution

15 Figure from: http://cs231n.github.io/convolutional-networks/

Trang 71

u= 1

nX

Trang 72

Convolutional Layer

⊗is convolutional operation

w(l ) isshared by all the neurons of l-th layer

Just need m + 1 parameters, and n(l + 1)= n(l )− m +1

Trang 73

Fully Connected Layer V.S Convolutional Layer

(a) Fully Connected Layer

(b) Convolutional Layer

Trang 74

Pooling Layer

It is common to periodically insert a pooling layer in-between successiveconvolutional layers

progressively reduce the spatial size of the representation

reduce the amount of parameters and computation in the networkavoid overtting

For a feature map X(l ), we divide it into several (non-)overlapped regions

Rk, k =1, · · · , K A pooling function down(· · · ) is

Trang 75

Pooling Layer

= f w(l +1)·down(Xl) + b(l +1)

Two choices of down(·): Maximum Pooling and Average Pooling

poolmax(Rk) =max

i ∈R k

poolavg(Rk) = 1

|Rk|X

i ∈R k

Trang 76

Pooling Layer

15 Figure from: http://cs231n.github.io/convolutional-networks/

Trang 77

Large Scale Visual Recognition Challenge

2010-2015

Trang 78

DeepMind's AlphaGo

15 http://cs231n.stanford.edu/

Trang 79

DeepMind's AlphaGo

15 http://cs231n.stanford.edu/

Trang 80

CNN for Sentence Modeling

Input: A sentence of length n,

After Lookup layer, X = [x1,x2, · · · ,xn] ∈ Rd ×n

Trang 81

CNN for Sentence Modeling16

Key stepsconvolutionzt:t+m−1 =

xt⊕xt+1⊕xt+m−1∈ Rdmmatrix-vector operation

xl

t = f (Wlzt:t+m−1+bl)Pooling (max over time)

xl

i =maxtxl − 1

i ,t

16 Collobert et al., Natural language processing (almost) from scratch.

Trang 82

multiple lters / multiple channels

pooling (max over time)

Trang 83

Dynamic CNN for Sentence Modeling18

Key stepsone-dimensional convolution

xl

i ,t = f (wl

ixi ,t:t+m−1+ bil)k-max pooling (max over time)

kl =max(ktop,|L−l|L n)(optional) folding: sums every two rows

in a feature map component-wise.multiple lters / multiple channels

18 N Kalchbrenner, E Grefenstette, and P Blunsom A Convolutional Neural Network for Modelling Sentences.

In: Proceedings of ACL 2014

Trang 84

Trang 85

Trang 86

Recurrent Neural Network (RNN)

f (ht−1,xt) otherwise (67)

Định dạng
Số trang	132
Dung lượng	8,48 MB