Iterative Scaling and Coordinate Descent Methods for Maximum EntropyFang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin Department of Computer Science National Taiwan Universi
Trang 1Iterative Scaling and Coordinate Descent Methods for Maximum Entropy
Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin
Department of Computer Science National Taiwan University Taipei 106, Taiwan {d93011,b92085,b92084,cjlin}@csie.ntu.edu.tw
Abstract
Maximum entropy (Maxent) is useful in
are one of the most popular approaches to
methods, it is difficult to understand them
and see the differences In this paper, we
create a general and unified framework for
Maxent Results show that it is faster than
1 Introduction
Maximum entropy (Maxent) is widely used in
many areas such as natural language processing
(NLP) and document classification Maxent
mod-els the conditional probability as:
Pw(y|x)≡Sw(x, y)/Tw(x), (1)
Sw(x, y)≡ePt w t f t (x,y), Tw(x)≡PySw(x, y),
Given an empirical probability distribution
˜
P (x, y) obtained from training samples, Maxent
minimizes the following negative log-likelihood:
minw −Px,yP (x, y) log P˜ w(y|x)
=PxP (x) log T˜ w(x) −PtwtP (f˜ t), (2)
overfit-ting the training samples, some add a
regulariza-tion term and solve:
min
x
˜
P (x)logTw(x)−P
t wtP(f˜ t)+Pt w 2
t
2σ 2 ,
(3)
1 A complete version of this work is at http:
//www.csie.ntu.edu.tw/˜cjlin/papers/
maxent_journal.pdf
on (3) instead of (2) because (3) is strictly convex
training Maxent models They all share the same
property of solving a one-variable sub-problem
at a time Existing IS methods include
Della Pietra et al (1997), and sequential
Goodman (2002) In optimization, coordinate
a one-variable sub-problem at a time With these
differences In Section 2, we propose a unified
an optimization viewpoint Using this framework,
Sec-tion 3 In SecSec-tion 4, we compare the proposed
Notationn is the number of features The total
number of nonzeros in samples and the average number of nonzeros per feature are respectively
#nz≡Px,yPt:ft(x,y)6=01 and ¯l≡ #nz/n
2 A Framework forISMethods 2.1 The Framework
differ in how they approximate the function reduc-tion They can also be categorized according to
par-allely updated In this section, we create a frame-work in Figure 1 for these methods
Sequential update For a sequential-update
algorithm, once a one-variable sub-problem is
next sub-problem The procedure is sketched in
285
Trang 2Iterative scaling Sequential update
FindAt(z) to approximate
L(w + zet) − L(w)
SCGIS
LetAt(z) = L(w+zet)−L(w)
CD
Parallel update
GIS,IIS
Figure 1: An illustration of various iterative scaling methods
Algorithm 1 A sequential-updateISmethod
Fort = 1, , n
sat-isfying (4)
follow-ing one-variable sub-problem:
minz At(z),
At(z) ≥ L(w + zet) − L(w)
=PxP (x) log˜ Tw+zet (x)
T w (x) + Qt(z) (4)
2σ 2 − z ˜P (ft) (5)
not ensure that the function value is strictly
zet) may be only the same as L(w) Therefore,
we can impose an additional condition
t(0) 6= 0
t(0) = ∇tL(w) = 0, the convexity of L(w)
implies that we cannot decrease the function value
method It solves the following sub-problem:
minz ACD
t (z) = L(w + zet) − L(w)
for minimization
Parallel update A parallel IS method
sub-problems After (approximately) solving all
Algo-rithm 2 gives the procedure The differentiable
L(w + z) − L(w) satisfying
A(z) ≥ L(w + z) − L(w), A(0) = 0, and
Similar to (4) and (6), the first two conditions
en-Algorithm 2 A parallel-updateISmethod
satis-fying (7)
wt← wt+ ¯zt sure that the function value is strictly decreasing
minzA(z) =Ptminz tAt(zt)
method possesses nice implementation properties
anyAt(zt) A parallel method could thus be
trans-formed to a sequential method using the same ap-proximate function, but not vice versa
2.2 Existing Iterative Scaling Methods
functions aim to bound the function reduction
L(w+z)−L(w)=PxP (x) log˜ Tw+z (x)
T w (x)+PtQt(zt),
(8)
simi-lar inequalities to get approximate functions They
x,y
˜
P (x)Pw(y|x)(ePt z t f t (x,y)−1)+P
t Qt(zt)
(9)
f#≡ maxx,yf#(x, y), f#(x, y) ≡Ptft(x, y),
zn+1= 0 Assuming ft(x, y) ≥ 0, ∀t, x, y, and
using Jensen’s inequality
f# (z t f # )≤Pn+1t=1 ft (x,y)
f # ez t f #
and
ePt z t f t (x,y) ≤Ptft (x,y)
f # ez t f #
+fn+1 (x,y)
f # , (10)
AGIS
t (zt) =e ztf# −1
f #
P
x,yP (x)P˜ w(y|x)ft(x, y) + Qt(zt)
Trang 3IISapplies Jensen’s inequality
≤Pt ft (x,y)
f # (x,y)ez t f # (x,y)
on (9) to get the approximate function
AIIS
t (zt) =X
x,yP (x)P˜ w(y|x)ft(x, y)e ztf#(x,y) −1
f # (x,y)
+ Qt(zt)
SCGISis a sequential-update method It replaces
f#inGISwithft#≡ maxx,yft(x, y) Using ztet
asz in (8), a derivation similar to (10) gives
ez t f t (x,y)≤ ft (x,y)
ft# ez t ft#+ft#−f t (x,y)
ft#
ASCGIS
t (zt) =eztft#−1
ft#
P
x,yP (x)P˜ w(y|x)ft(x, y) + Qt(zt)
methods (proof omitted):
Theorem 1 Assume each sub-problem As
t(zt) is
exactly minimized, where s isIIS,GIS,SCGIS, or
CD The sequence{wk} generated by any of these
four methods linearly converges That is, there is
a constant µ ∈ (0, 1) such that
L(wk+1)−L(w∗) ≤ (1−µ)(L(wk)−L(w∗)), ∀k,
wherew∗is the global optimum of (3).
2.3 Solving one-variable sub-problems
t(zt) = 0,
GISandSCGISboth have a simple closed-form
solution of the sub-problem With the
regular-ization term, the sub-problems no longer have a
closed-form solution We discuss the cost of
solv-ing sub-problems by the Newton method, which
zt← zt− As
t′(zt)/As
t′′(zt) (11) Heres indicates anISor aCDmethod
t′(zt) as
t′′(zt) is similar We have
As
t′(zt)=Px,yP (x)P˜ w(y|x)ft(x, y)ez t f s (x,y)
+ Q′
where
fs(x, y) ≡
f#ifs isGIS,
ft#ifs isSCGIS,
f#(x, y) if s isIIS
ACD
t ′(zt)=Q′
t(zt)+Px,yP (x)P˜ w+z t e t(y|x)ft(x, y)
(13)
sub-problems, but sequential-update methods
evalu-ates Pw(y|x) after every sub-problem Consider
New-ton method
CD GIS SCGIS IIS
Each subsequent
O(#nz) operations to evaluate Sw+z t e t(x, y) and
Tw+ztet(x) ∀x, y A trick to trade memory for
Sw+z t e t(x, y)=Sw(x, y)ezt f t (x,y),
Tw+z t e t(x)=Tw(x)+PySw(x, y)(ez t f t (x,y)−1)
0, this procedure reduces the the O(#nz)
in SCGIS (Goodman, 2002) Thus, the first Newton iteration of all methods discussed here
GIS and SCGIS, if P
x,yP (x)P˜ w(y|x)ft(x, y)
is stored at the first Newton iteration, then (12)
f#(x, y) of (12) depends on x and y, we cannot
x,yP (x)P˜ w(y|x)ft(x, y) as in GIS and
SCGIS Hence each Newton direction needsO(¯l)
We summarize the cost for solving sub-problems
in Table 1
3 Comparison and a NewCDMethod 3.1 Comparison ofIS/CDmethods
falls into a place between two extreme designs:
There is a tradeoff between the tightness to bound the function difference and the hardness to solve
meth-ods fit into this explanation, we obtain relation-ships of their approximate functions:
ACDt (zt) ≤ ASCGISt (zt) ≤ AGISt (zt),
ACDt (zt) ≤ AIISt (zt) ≤ AGISt (zt) ∀ zt (14)
Trang 4give faster convergence by handling fewer
sub-problems, the total time may not be less due to
the higher cost of each sub-problem
3.2 A FastCDMethod
solving each sub-problem but still enjoys fast
fi-nal convergence This method is modified from
t (z) by
ap-plying only one Newton iteration The Newton
d = −ACDt ′(0)/ACDt ′′(0) (15)
As taking the full Newton direction may not
de-crease the function value, we need a line search
the following sufficient decrease condition:
ACDt (z)−ACDt (0) = ACDt (z) ≤ γzACDt ′(0), (16)
1, β, β2, , where β ∈ (0, 1) The line search
procedure is guaranteed to stop (proof omitted)
We can further prove that near the optimum two
results hold: First, the Newton direction (15)
sat-isfies the sufficient decrease condition (16) with
λ=1 Then the cost for each sub-problem is O(¯l),
similar to that for exactly solving sub-problems of
GISorSCGIS This result is important as
t (z) Second, taking one
t (zt) reduces the
solving sub-problems, while still maintains fast
convergence
4 Experiments
(http://www.nltk.org) and
cnts.ua.ac.be/conll2000/chunking)
(http://maxent.sourceforge.net)
GIS, SCGIS, andLBFGSfor comparisons We
10 −2
10 −1
10 0
10 1
Training Time (s)
CD SCGIS GIS LBFGS
(a) BROWN
10 −2
10 −1
10 0
10 1
Training Time (s)
CD SCGIS GIS LBFGS
(b) CoNLL2000
94 94.5 95 95.5 96 96.5 97
Training Time (s)
CD SCGIS GIS LBFGS
(c) BROWN
90 90.5 91 91.5 92 92.5 93 93.5
Training Time (s)
CD SCGIS GIS LBFGS
(d) CoNLL2000
Figure 2: First row: time versus the relative func-tion value difference (17) Second row: time ver-sus testing accuracy/F1 Time is in seconds
γ = 0.001 in (16)
We begin at checking time versus the relative difference of the function value to the optimum:
Results are in the first row of Figure 2 We check
in the second row of Figure 2 about testing ac-curacy/F1 versus training time Among the three
the tightness of their approximation functions; see
does not perform well in the beginning
5 Conclusions
In summary, we create a general framework for
References
K.-W Chang, C.-J Hsieh, and C.-J Lin 2008 Coor-dinate descent method for large-scale L2-loss linear
SVM JMLR, 9:1369–1398.
John N Darroch and Douglas Ratcliff 1972
Gener-alized iterative scaling for log-linear models Ann.
Math Statist., 43(5):1470–1480.
Stephen Della Pietra, Vincent Della Pietra, and John Lafferty 1997 Inducing features of random fields.
IEEE PAMI, 19(4):380–393.
Joshua Goodman 2002 Sequential conditional
gener-alized iterative scaling In ACL, pages 9–16.
Robert Malouf 2002 A comparison of algorithms for maximum entropy parameter estimation In
CONLL.