Báo cáo khoa học: "Iterative Scaling and Coordinate Descent Methods for Maximum Entropy" pdf

Iterative Scaling and Coordinate Descent Methods for Maximum EntropyFang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin Department of Computer Science National Taiwan Universi

Trang 1

Iterative Scaling and Coordinate Descent Methods for Maximum Entropy

Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin

Department of Computer Science National Taiwan University Taipei 106, Taiwan {d93011,b92085,b92084,cjlin}@csie.ntu.edu.tw

Abstract

Maximum entropy (Maxent) is useful in

are one of the most popular approaches to

methods, it is difficult to understand them

and see the differences In this paper, we

create a general and unified framework for

Maxent Results show that it is faster than

1 Introduction

Maximum entropy (Maxent) is widely used in

many areas such as natural language processing

(NLP) and document classification Maxent

mod-els the conditional probability as:

Pw(y|x)≡Sw(x, y)/Tw(x), (1)

Sw(x, y)≡ePt w t f t (x,y), Tw(x)≡PySw(x, y),

Given an empirical probability distribution

˜

P (x, y) obtained from training samples, Maxent

minimizes the following negative log-likelihood:

minw −Px,yP (x, y) log P˜ w(y|x)

=PxP (x) log T˜ w(x) −PtwtP (f˜ t), (2)

overfit-ting the training samples, some add a

regulariza-tion term and solve:

min

x

˜

P (x)logTw(x)−P

t wtP(f˜ t)+Pt w 2

t

2σ 2 ,

(3)

1 A complete version of this work is at http:

//www.csie.ntu.edu.tw/˜cjlin/papers/

maxent_journal.pdf

on (3) instead of (2) because (3) is strictly convex

training Maxent models They all share the same

property of solving a one-variable sub-problem

at a time Existing IS methods include

Della Pietra et al (1997), and sequential

Goodman (2002) In optimization, coordinate

a one-variable sub-problem at a time With these

differences In Section 2, we propose a unified

an optimization viewpoint Using this framework,

Sec-tion 3 In SecSec-tion 4, we compare the proposed

Notationn is the number of features The total

number of nonzeros in samples and the average number of nonzeros per feature are respectively

#nz≡Px,yPt:ft(x,y)6=01 and ¯l≡ #nz/n

2 A Framework forISMethods 2.1 The Framework

differ in how they approximate the function reduc-tion They can also be categorized according to

par-allely updated In this section, we create a frame-work in Figure 1 for these methods

Sequential update For a sequential-update

algorithm, once a one-variable sub-problem is

next sub-problem The procedure is sketched in

285

Trang 2

Iterative scaling Sequential update

FindAt(z) to approximate

L(w + zet) − L(w)

SCGIS

LetAt(z) = L(w+zet)−L(w)

CD

Parallel update

GIS,IIS

Figure 1: An illustration of various iterative scaling methods

Algorithm 1 A sequential-updateISmethod

Fort = 1, , n

sat-isfying (4)

follow-ing one-variable sub-problem:

minz At(z),

At(z) ≥ L(w + zet) − L(w)

=PxP (x) log˜ Tw+zet (x)

T w (x) + Qt(z) (4)

2σ 2 − z ˜P (ft) (5)

not ensure that the function value is strictly

zet) may be only the same as L(w) Therefore,

we can impose an additional condition

t(0) 6= 0

t(0) = ∇tL(w) = 0, the convexity of L(w)

implies that we cannot decrease the function value

method It solves the following sub-problem:

minz ACD

t (z) = L(w + zet) − L(w)

for minimization

Parallel update A parallel IS method

sub-problems After (approximately) solving all

Algo-rithm 2 gives the procedure The differentiable

L(w + z) − L(w) satisfying

A(z) ≥ L(w + z) − L(w), A(0) = 0, and

Similar to (4) and (6), the first two conditions

en-Algorithm 2 A parallel-updateISmethod

satis-fying (7)

wt← wt+ ¯zt sure that the function value is strictly decreasing

minzA(z) =Ptminz tAt(zt)

method possesses nice implementation properties

anyAt(zt) A parallel method could thus be

trans-formed to a sequential method using the same ap-proximate function, but not vice versa

2.2 Existing Iterative Scaling Methods

functions aim to bound the function reduction

L(w+z)−L(w)=PxP (x) log˜ Tw+z (x)

T w (x)+PtQt(zt),

(8)

simi-lar inequalities to get approximate functions They

x,y

˜

P (x)Pw(y|x)(ePt z t f t (x,y)−1)+P

t Qt(zt)

(9)

f#≡ maxx,yf#(x, y), f#(x, y) ≡Ptft(x, y),

zn+1= 0 Assuming ft(x, y) ≥ 0, ∀t, x, y, and

using Jensen’s inequality

f# (z t f # )≤Pn+1t=1 ft (x,y)

f # ez t f #

and

ePt z t f t (x,y) ≤Ptft (x,y)

f # ez t f #

+fn+1 (x,y)

f # , (10)

AGIS

t (zt) =e ztf# −1

f #

P

x,yP (x)P˜ w(y|x)ft(x, y) + Qt(zt)

Trang 3

IISapplies Jensen’s inequality

≤Pt ft (x,y)

f # (x,y)ez t f # (x,y)

on (9) to get the approximate function

AIIS

t (zt) =X

x,yP (x)P˜ w(y|x)ft(x, y)e ztf#(x,y) −1

f # (x,y)

+ Qt(zt)

SCGISis a sequential-update method It replaces

f#inGISwithft#≡ maxx,yft(x, y) Using ztet

asz in (8), a derivation similar to (10) gives

ez t f t (x,y)≤ ft (x,y)

ft# ez t ft#+ft#−f t (x,y)

ft#

ASCGIS

t (zt) =eztft#−1

ft#

P

x,yP (x)P˜ w(y|x)ft(x, y) + Qt(zt)

methods (proof omitted):

Theorem 1 Assume each sub-problem As

t(zt) is

exactly minimized, where s isIIS,GIS,SCGIS, or

CD The sequence{wk} generated by any of these

four methods linearly converges That is, there is

a constant µ ∈ (0, 1) such that

L(wk+1)−L(w∗) ≤ (1−µ)(L(wk)−L(w∗)), ∀k,

wherew∗is the global optimum of (3).

2.3 Solving one-variable sub-problems

t(zt) = 0,

GISandSCGISboth have a simple closed-form

solution of the sub-problem With the

regular-ization term, the sub-problems no longer have a

closed-form solution We discuss the cost of

solv-ing sub-problems by the Newton method, which

zt← zt− As

t′(zt)/As

t′′(zt) (11) Heres indicates anISor aCDmethod

t′(zt) as

t′′(zt) is similar We have

As

t′(zt)=Px,yP (x)P˜ w(y|x)ft(x, y)ez t f s (x,y)

+ Q′

where

fs(x, y) ≡





f#ifs isGIS,

ft#ifs isSCGIS,

f#(x, y) if s isIIS

ACD

t ′(zt)=Q′

t(zt)+Px,yP (x)P˜ w+z t e t(y|x)ft(x, y)

(13)

sub-problems, but sequential-update methods

evalu-ates Pw(y|x) after every sub-problem Consider

New-ton method

CD GIS SCGIS IIS

Each subsequent

O(#nz) operations to evaluate Sw+z t e t(x, y) and

Tw+ztet(x) ∀x, y A trick to trade memory for

Sw+z t e t(x, y)=Sw(x, y)ezt f t (x,y),

Tw+z t e t(x)=Tw(x)+PySw(x, y)(ez t f t (x,y)−1)

0, this procedure reduces the the O(#nz)

in SCGIS (Goodman, 2002) Thus, the first Newton iteration of all methods discussed here

GIS and SCGIS, if P

x,yP (x)P˜ w(y|x)ft(x, y)

is stored at the first Newton iteration, then (12)

f#(x, y) of (12) depends on x and y, we cannot

x,yP (x)P˜ w(y|x)ft(x, y) as in GIS and

SCGIS Hence each Newton direction needsO(¯l)

We summarize the cost for solving sub-problems

in Table 1

3 Comparison and a NewCDMethod 3.1 Comparison ofIS/CDmethods

falls into a place between two extreme designs:

There is a tradeoff between the tightness to bound the function difference and the hardness to solve

meth-ods fit into this explanation, we obtain relation-ships of their approximate functions:

ACDt (zt) ≤ ASCGISt (zt) ≤ AGISt (zt),

ACDt (zt) ≤ AIISt (zt) ≤ AGISt (zt) ∀ zt (14)

Trang 4

give faster convergence by handling fewer

sub-problems, the total time may not be less due to

the higher cost of each sub-problem

3.2 A FastCDMethod

solving each sub-problem but still enjoys fast

fi-nal convergence This method is modified from

t (z) by

ap-plying only one Newton iteration The Newton

d = −ACDt ′(0)/ACDt ′′(0) (15)

As taking the full Newton direction may not

de-crease the function value, we need a line search

the following sufficient decrease condition:

ACDt (z)−ACDt (0) = ACDt (z) ≤ γzACDt ′(0), (16)

1, β, β2, , where β ∈ (0, 1) The line search

procedure is guaranteed to stop (proof omitted)

We can further prove that near the optimum two

results hold: First, the Newton direction (15)

sat-isfies the sufficient decrease condition (16) with

λ=1 Then the cost for each sub-problem is O(¯l),

similar to that for exactly solving sub-problems of

GISorSCGIS This result is important as

t (z) Second, taking one

t (zt) reduces the

solving sub-problems, while still maintains fast

convergence

4 Experiments

(http://www.nltk.org) and

cnts.ua.ac.be/conll2000/chunking)

(http://maxent.sourceforge.net)

GIS, SCGIS, andLBFGSfor comparisons We

10 −2

10 −1

10 0

10 1

Training Time (s)

CD SCGIS GIS LBFGS

(a) BROWN

10 −2

10 −1

10 0

10 1

Training Time (s)

CD SCGIS GIS LBFGS

(b) CoNLL2000

94 94.5 95 95.5 96 96.5 97

Training Time (s)

CD SCGIS GIS LBFGS

(c) BROWN

90 90.5 91 91.5 92 92.5 93 93.5

Training Time (s)

CD SCGIS GIS LBFGS

(d) CoNLL2000

Figure 2: First row: time versus the relative func-tion value difference (17) Second row: time ver-sus testing accuracy/F1 Time is in seconds

γ = 0.001 in (16)

We begin at checking time versus the relative difference of the function value to the optimum:

Results are in the first row of Figure 2 We check

in the second row of Figure 2 about testing ac-curacy/F1 versus training time Among the three

the tightness of their approximation functions; see

does not perform well in the beginning

5 Conclusions

In summary, we create a general framework for

References

K.-W Chang, C.-J Hsieh, and C.-J Lin 2008 Coor-dinate descent method for large-scale L2-loss linear

SVM JMLR, 9:1369–1398.

John N Darroch and Douglas Ratcliff 1972

Gener-alized iterative scaling for log-linear models Ann.

Math Statist., 43(5):1470–1480.

Stephen Della Pietra, Vincent Della Pietra, and John Lafferty 1997 Inducing features of random fields.

IEEE PAMI, 19(4):380–393.

Joshua Goodman 2002 Sequential conditional

gener-alized iterative scaling In ACL, pages 9–16.

Robert Malouf 2002 A comparison of algorithms for maximum entropy parameter estimation In

CONLL.

Định dạng
Số trang	4
Dung lượng	160,09 KB