Convolutional Neural NetworkRecurrent Neural Network Recursive Neural Network 4 Challenges & Open Problems Xipeng Qiu Fudan University Deep Learning for Natural Language Processing 2 / 1
Trang 1Deep Learning for Natural Language Processing
Xipeng Qiuxpqiu@fudan.edu.cnhttp://nlp.fudan.edu.cnhttp://nlp.fudan.edu.cnFudan University2016/5/29, CCF ADL, Beijing
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 1 / 131
Trang 2Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 2 / 131
Trang 3Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 3 / 131
Trang 4Begin with AI
Human: Memory, Computation
Computer: Learning, Thinking, Creativity
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 4 / 131
Trang 5Turing Test
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 5 / 131
Trang 6Articial Intelligence
Denition from Wikipedia
Articial intelligence (AI) is the intelligence exhibited by machines
Colloquially, the term articial intelligence is likely to be applied when amachine uses cutting-edge techniques to competently perform or mimic
cognitive functions that we intuitively associate with human minds, such
as learning and problem solving
Trang 7Challenge: Sematic Gap
Trang 8Challenge: Sematic Gap
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 8 / 131
Trang 9Challenge: Sematic Gap
Figure: Guernica (Picasso)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 9 / 131
Trang 10Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 10 / 131
Trang 11Machine Learning
Model
Learning AlgorithmTraining Data: (x, y)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 11 / 131
Trang 12Basic Concepts of Machine Learning
Input Data: (xi, yi),1 ≤ i ≤ m
Model:
Linear Model: y = f (x) = w T x + b
Generalized Linear Model: y = f (x) = w T φ(x ) + b
Non-linear Model: Neural Network
Objective Function: Q(θ) + λ kθk2
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 12 / 131
Trang 15i = 1
yi is distribution of gold labels Thus, Eq 6 is Cross Entropy Loss function
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 15 / 131
Trang 18i = 1
∂R θt; x(i ), y(i )
λis also called Learning Rate in ML
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 18 / 131
Trang 19Stochastic Gradient Descent (SGD)
Trang 21Figure: Binary Linear Classication
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 21 / 131
Trang 22Logistic Regression
How to learn the parameter w: Perceptron, Logistic Regression, etc.The posterior probability of y = 1 is
where, σ(·) is logistic function
The posterior probability of y = 0 is P(y = 0|x) = 1 − P(y = 1|x)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 22 / 131
Trang 23) log 1 − σ(w T x (i )
)
(16) The gradient of J (w) is
∂J ( w)
∂w =
N X
Trang 25i =1exp(w>
To represent class c by one-hot vector
where I () is indictor function
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 25 / 131
Trang 26z = WTx is input of softmax function.
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 26 / 131
Trang 27Softmax Regression
Given training set (x(i ),y(i )),1 ≤ i ≤ N, the cross-entropy loss is
J (W ) = −
NX
i = 1
CX
i = 1(y(i ))Tlog ˆy(i )The gradient of J (W ) is
∂J (W )
NX
Trang 28The idea pipeline of NLP
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 28 / 131
Trang 29But in practice: End-to-End
Model
I like this movie.
I dislike this movie.
Trang 30Feature Extraction
Bag-of-Word
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 30 / 131
Trang 31Text Classication
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 31 / 131
Trang 32Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 32 / 131
Trang 33Articial Neural Network
Articial neural networks1 (ANNs) are a family of models inspired bybiological neural networks (the central nervous systems of animals, inparticular the brain)
Articial neural networks are generally presented as systems of
interconnected neurons which exchange messages between each other
1 https://en.wikipedia.org/wiki/Articial_neural_network
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 33 / 131
Trang 362 X Glorot, A Bordes, and Y Bengio Deep sparse rectier neural networks In:
International Conference on Articial Intelligence and Statistics 2011, pp 315323
3 V Nair and G E Hinton Rectied linear units improve restricted boltzmann machines In:
Proceedings of the 27th International Conference on Machine Learning (ICML-10) 2010, pp 807814
4 C Dugas et al Incorporating second-order functional knowledge for better option pricing In:
Advances in Neural Information Processing Systems (2001)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 36 / 131
Trang 37(b) tanh
0 1 2 3 4 5
(c) rectier
1 2 3 4 5
(d) softplus
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 37 / 131
Trang 38Types of Articial Neural Network5
Feedforward neural network, also called Multilayer Perceptron (MLP).Recurrent neural network
5 https://en.wikipedia.org/wiki/Types_of_articial_neural_networks
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 38 / 131
Trang 39Basic Concepts of Deep Learning
Model: Articial neural networks that consist of multiple hiddennon-linear layers
Function: Non-linear function y = σ(Piwixi + b)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 39 / 131
Trang 40Feedforward Neural Network
In feedforward neural network, the information moves in only one directionforward: From the input nodes data goes through the hidden nodes (if any)and to the output nodes
There are no cycles or loops in the network
Trang 41Feedforward Computing
Denitions:
nl Number of neurons in l-th layer;
fl(·) Activation function in l-th layer;
W(l )∈ Rnl×n l − 1
weight matrix between l − 1-th layer and l-th layer;
b(l ) ∈ Rnl bias vector between l − 1-th layer and l-th layer;
Trang 43Combining feedforward network and Machine Learning
Given training samples (x(i ), y(i )),1 ≤ i ≤ N, and feedforward network
f (x|w, b), the objective function is
J(W ,b) =
NX
Trang 46a(l −1) j
Trang 500.2 0.25
(e) logistic
− 4 − 2 0 2 4 6 0
0.2 0.4 0.6 0.8 1
(f) tanh
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 50 / 131
Trang 52Tricks and Skills6 7
Use ReLU non-linearities
Use cross-entropy loss for classication
SDG+mini-batch
Shue the training samples ( ←− very important)
Early-Stop
Normalize the input variables (zero mean, unit variance)
Schedule to decrease the learning rate
Use a bit of L1 or L2 regularization on the weights (or a combination)Use dropout for regularization
Data Argument
6 G B Orr and K.-R M ller Neural networks: tricks of the trade Springer, 2003
7 Geo Hinton, Yoshua Bengio & Yann LeCun, Deep Learning, NIPS 2015 Tutorial.
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 52 / 131
Trang 53Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 53 / 131
Trang 54Neural Models for Representation Learning General Architecture
General Neural Architectures for NLP
How to use neural network for the NLP tasks?
Distributed Representation
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131
Trang 55General Neural Architectures for NLP
How to use neural network for the NLP tasks?
Distributed Representation
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131
Trang 56General Neural Architectures for NLP8
1 represent the words/features with dense vectors (embeddings) by lookup table;
2 concatenate the vectors;
3 multi-layer neural networks classication
matching ranking
8 R Collobert et al Natural language processing (almost) from scratch In:
The Journal of Machine Learning Research (2011)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 55 / 131
Trang 57Dierence with the traditional methods
Features (One-hot Representation) (Distributed Representation)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 56 / 131
Trang 58The key point is
how to encode the word, phrase, sentence, paragraph, or even documentinto the distributed representation?
Representation Learning
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 57 / 131
Trang 59Representation Learning for NLP
Hierachical Models two-level CNN
Sequence Models LSTM, Paragraph Vector
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 58 / 131
Trang 60Let's start with Language Model
A statistical language model is a probability distribution over sequences
i = 1
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 59 / 131
Trang 61Neural Probabilistic Language Model9
turn unsupervised learning into supervised learning;
avoid the data sparsity of n-gram model;
project each word into a low dimensional space
9 Y Bengio, R Ducharme, and P Vincent A Neural probabilistic language model In:
Journal of Machine Learning Research (2003)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 60 / 131
Trang 62Problem of Very Large Vocabulary
Softmax output:
Pθh = exp(sθ(w , h))P
Unfortunately both evaluating Ph
likelihood gradient requires normalizing over the entire vocabulary
Hierarchical Softmax: a tree-structured vocabulary10
Negative Sampling11, noise-contrastive estimation (NCE)12
10 A Mnih and G Hinton A scalable hierarchical distributed language model In:
Advances in neural information processing systems (2009) ; F Morin and Y Bengio Hierarchical Probabilistic Neural Network Language Model. In: Aistats Vol 5 Citeseer 2005, pp 246252
11 T Mikolov et al Ecient estimation of word representations in vector space In:
arXiv preprint arXiv:1301.3781 (2013)
12 A Mnih and K Kavukcuoglu Learning word embeddings eciently with noise-contrastive estimation In: Advances in Neural Information Processing Systems 2013, pp 22652273
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 61 / 131
Trang 63Linguistic Regularities of Word Embeddings13
13 T Mikolov, W.-t Yih, and G Zweig Linguistic Regularities in Continuous Space Word Representations. In: HLT-NAACL 2013, pp 746751
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 62 / 131
Trang 64Skip-Gram Model14
14 Mikolov et al., Ecient estimation of word representations in vector space.
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 63 / 131
Trang 65Skip-Gram Model
Given a pair of words (w, c), the probability that the word c is observed inthe context of the target word w is given by
Pr (D =1|w, c) = 1 + exp(−w1 Tc),where w and c are embedding vectors of w and c respectively
The probability of not observing word c in the context of w is given by,
1 + exp(−wTc).
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 64 / 131
Trang 66Skip-Gram Model with Negative Sampling
Given a training set D, the word embeddings are learned by maximizing thefollowing objective function:
Trang 67Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 66 / 131
Trang 68Convolutional Neural Network
Convolutional neural network (CNN, or ConvNet) is a type of
feed-forward articial neural network
Trang 69k= 1
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 68 / 131
Trang 70One-dimensional convolution
15 Figure from: http://cs231n.github.io/convolutional-networks/
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 69 / 131
Trang 71u= 1
nX
Trang 72Convolutional Layer
⊗is convolutional operation
w(l ) isshared by all the neurons of l-th layer
Just need m + 1 parameters, and n(l + 1)= n(l )− m +1
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 71 / 131
Trang 73Fully Connected Layer V.S Convolutional Layer
(a) Fully Connected Layer
(b) Convolutional Layer
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 72 / 131
Trang 74Pooling Layer
It is common to periodically insert a pooling layer in-between successiveconvolutional layers
progressively reduce the spatial size of the representation
reduce the amount of parameters and computation in the networkavoid overtting
For a feature map X(l ), we divide it into several (non-)overlapped regions
Rk, k =1, · · · , K A pooling function down(· · · ) is
Trang 75Pooling Layer
= f w(l +1)·down(Xl) + b(l +1)
Two choices of down(·): Maximum Pooling and Average Pooling
poolmax(Rk) =max
i ∈R k
poolavg(Rk) = 1
|Rk|X
i ∈R k
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 74 / 131
Trang 76Pooling Layer
15 Figure from: http://cs231n.github.io/convolutional-networks/
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 75 / 131
Trang 77Large Scale Visual Recognition Challenge
2010-2015
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 76 / 131
Trang 78DeepMind's AlphaGo
15 http://cs231n.stanford.edu/
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 77 / 131
Trang 79DeepMind's AlphaGo
15 http://cs231n.stanford.edu/
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 78 / 131
Trang 80CNN for Sentence Modeling
Input: A sentence of length n,
After Lookup layer, X = [x1,x2, · · · ,xn] ∈ Rd ×n
Trang 81CNN for Sentence Modeling16
Key stepsconvolutionzt:t+m−1 =
xt⊕xt+1⊕xt+m−1∈ Rdmmatrix-vector operation
xl
t = f (Wlzt:t+m−1+bl)Pooling (max over time)
xl
i =maxtxl − 1
i ,t
16 Collobert et al., Natural language processing (almost) from scratch.
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 80 / 131
Trang 82CNN for Sentence Modeling17
multiple lters / multiple channels
pooling (max over time)
Trang 83Dynamic CNN for Sentence Modeling18
Key stepsone-dimensional convolution
xl
i ,t = f (wl
ixi ,t:t+m−1+ bil)k-max pooling (max over time)
kl =max(ktop,|L−l|L n)(optional) folding: sums every two rows
in a feature map component-wise.multiple lters / multiple channels
18 N Kalchbrenner, E Grefenstette, and P Blunsom A Convolutional Neural Network for Modelling Sentences.
In: Proceedings of ACL 2014
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 82 / 131
Trang 84CNN for Sentence Modeling19
Trang 85Convolutional Neural Network
Recurrent Neural Network
Recursive Neural Network
4 Challenges & Open Problems
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 84 / 131
Trang 86Recurrent Neural Network (RNN)
f (ht−1,xt) otherwise (67)
Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 85 / 131