f(l) = XN
i=1
ai,il2i + XN
i=1
X
j6=i
ai,jlilj = XN
i=1
ai,il2i + X2 i=1
liX
j6=i
ai,jlj+ XN
i=3
liX
j6=i
ai,jlj
= XN
i=1
ai,il2i +X
j6=1
a1,jlj+ XN
i=3
li(−ai,ili) = XN
i=1
ai,il2i +X
j6=1
a1,jlj− XN
i=3
ai,ili2
= X2
i=1
ai,il2i +X
j6=1
a1,jlj =a1,1+X
j6=1
a1,jlj =−X
j6=1
a1,j+X
j6=1
a1,jlj
=X
j6=1
a1,j(lj−1)≥a1,2(l2−1) =−a1,2 .
Since our choice ofl1= 1 andl2 = 0 was arbitrary, we have thatf(l)≥mini6=j{−ai,j}>
0. Thus,−l⊤∇2Φ(L)l>0, completing the proof.
We can now apply Theorem 3.8to the best expert setting.
Theorem 3.13. If ∂L∂2Φ
i∂Lj >0 for every i6=j and every L s.t. δ(L)≤a, then
(i) For every η > 0, it holds that maxt{RA,t} ≥ min{ρ1η(a),η2ρ2(a)q}, and for η = q2ρ1(a)
ρ2(a)q, maxt{RA,t} ≥p
ρ1(a)ρ2(a)/2ã√q.
(ii) If for any sequence with relative quadratic variation qT ≤ q′ we have RA,T ≤ c√
q′, then for any such sequence, maxt{RA,t} ≥ ρ1(a)ρ2c2(a)ã√qTq′. In particular, if qT = Θ(q′), then maxt{RA,t}= Ω(√qT).
3.4 Application to Specific Regret Minimization Algo- rithms
3.4.1 Online Gradient Descent with Linear Costs
In this subsection, we deal with the Lazy Projection variant of the OGD algorithm ([92]) with a fixed learning rate η and linear costs. In this setting, for each t, OGD selects a weight vector according to the rulext+1 = arg minx∈K{kx+ηLtk2}, where K ⊂RN is compact and convex. As observed in [49] and [47], this algorithm is equivalent to RF T L(η,R), where R(x) = 12kxk22, namely, setting xt+1 = arg minx∈K{xãLt + (1/2η)kxk22}. In what follows we will make the assumption that K ⊇ B(0, a), where B(0, a) is the closed ball with radiusacentered at0, for somea >0.
Note that solving the above minimization problem without the restriction x ∈ K yields x′t+1 = −ηLt. However, if kLtk2 ≤ a/η, then x′t+1 ∈ K, and then, in fact, xt+1 =−ηLt. By Theorem 3.4,
Φη(Lt) =xt+1ãLt+ (1/η)R(xt+1) =−ηLtãLt+ (1/2η)k −ηLtk22
=−(η/2)kLtk22 .
Thus, if kLk2 ≤a, then Φ(L) = −(1/2)kLk22 and also ∇2Φ(L) = −I, where I is the identity matrix. By Lemma3.9,
ρ1(a) = min
kLk2=a{−1
2kLk22−min
u∈K{uãL}} ≥ min
kLk2=a{−1
2kLk22−(−L)ãL}= 1 2a2, where we used the fact that−L∈ KifkLk2 =a. In addition, by Lemma3.9,
ρ2(a) = minkLk2≤a,klk2=1{−l⊤(−I)l}= 1.
By Theorem3.10, we have that maxt{RA,t} ≥min{a2/(2η),(η/2)Q}, and for η= √a Q, maxt{RA,t} ≥ a2√
Q.
3.4.2 The Hedge Algorithm
Hedge is the most notable regret minimization algorithm for the expert setting. We have K= ∆N, and the algorithm gives a weightpi,t+1 = pi,0e
−ηLi,t
PN
j=1pj,0e−ηLj,t to experti at time t+ 1, where the initial weights pi,0 and the learning rate η are parameters, and PN
i=1pi,0 = 1.
It is easy to see that for the potential Φη(L) =−(1/η) ln(PN
i=1pi,0e−ηLi), we have thatp= (p1, . . . , pN) =∇Φη(L). The Hessian∇2Φη has the following simple form:
Lemma 3.14. Let L ∈ RN and denote p = ∇Φη(L). Then ∇2Φη(L) = ηã(pp⊤− diag(p))0, where diag(p) is the diagonal matrix with p as its diagonal.
The proof is given in Section 3.5. We will assume p1,0 = . . . = pN,0 = 1/N, and write Hed(η) for Hedge with parameters η and the uniform distribution. Thus, Φ(L) =−ln((1/N)PN
i=1e−Li), and we have by Lemma 3.14that ∂L∂2Φ
i∂Lj >0 for every i 6= j. Therefore, by Lemma 3.12, ρ1, ρ2 > 0. We now need to calculate ρ1 and ρ2. This is straightforward in the case ofρ1, but forρ2 we give the value only forN = 2.
3.4. APPLICATION TO SPECIFIC REGRET MINIMIZATION ALGORITHMS 51 Lemma 3.15. For any N ≥2, ρ1(a) = lnN−1+eN −a. ForN = 2, it holds that ρ2(a) =
ea/2+e−a/2−2
. Proof. By Lemma3.12,
ρ1(a) = min
L∈N(a){Φ(L)} −Φ(0) = min
L∈N(a){Φ(L)}=−ln max
L∈N(a)
(1 N
XN i=1
e−Li )
=−ln 1
N((N −1)e−0+e−a)
= ln N
N−1 +e−a .
In addition,ρ2(a) = minL∈[0,a]N,l∈N(1){−l⊤∇2Φ(L)l}>0. ForN = 2, we haveN(1) = {(1,0),(0,1)}. Therefore,−l⊤∇2Φ(L)lis either−∇2Φ(L)1,1 or−∇2Φ(L)2,2. Denoting p=p(L) =∇Φ(L), we have that−∇2Φ(L)1,1=p1−p21 and−∇2Φ(L)2,2 =p2−p22 by Lemma3.14. Since p1 = 1−p2, we havep1−p21 =p2−p22. Now,
p1(1−p1) = e−L1
e−L1+e−L2 ã e−L2
e−L1 +e−L2 =e−(L1+L2) e−L1 +e−L2−2
=
e(L2−L1)/2+e(L1−L2)/2−2
.
The function (ex+e−x)−2 is decreasing for x≥0, so the smallest value is attained for
|L1−L2|=a. Thus, for N = 2, ρ2(a) = (ea/2+e−a/2)−2.
Picking a= 1.2, we have by Theorem 3.13that Theorem 3.16. ForN = 2, there exists η s.t.
maxt {RHed(η),t} ≥p
ρ1(a)ρ2(a)q/2≥0.195√q .
For a general N we may still lower bound maxt{RHed(η),t} by providing a lower bound for ρ2(a). We use the fact that the term −l⊤∇2Φ(L)l, which is minimized in the definition of ρ2(a), may be interpreted as the variance of a certain discrete and bounded random variable. Thus, the key element that will be used is a general lower bound on the variance of such variables. This bound, possibly of separate interest, is given in the following lemma, whose proof can be found in Section 3.5.
Lemma 3.17. Let x1 < . . . < xN ∈ R, 0 < p1, . . . , pN < 1, and PN
k=1pk = 1, and let X be a random variable that obtains the value xk with probability pk, for every 1≤k≤N. Then for every 1≤i < j ≤N, V ar(X) ≥ ppii+ppjj ã(xi−xj)2, and equality
is attained iff N = 2, or N = 3 with i= 1, j= 3, and x2 = p1xp1+p3x3
1+p3 . In particular, V ar(X)≥ pp11+ppNN ã(xN −x1)2 ≥ 12mink{pk}(xN −x1)2.
We comment that for the special casep1 =. . .=pN = 1/N, Lemma3.17yields the inequality V ar(X) ≥(xN −x1)2/(2N), sometimes referred to as the Sz˝okefalvi Nagy inequality [88].
We now proceed to lower boundρ2(a) for anyN.
Lemma 3.18. For any N ≥2, it holds that ρ2(a)≥ 2(N−11)ea+2.
Proof. Recall that ρ2(a) = minL∈[0,a]N,l∈N(1){−l⊤∇2Φ(L)l}. By Lemma 3.14,
∇2Φ(L) =pp⊤−diag(p), wherep=p(L) =∇Φ(L), and thus,
−l⊤∇2Φ(L)l=l⊤diag(p)l−l⊤pp⊤l= XN i=1
pili2− XN i=1
pili
!2
.
Assume for now that l1, . . . , lN are distinct. Then −l⊤∇2Φ(L)l = V ar(X), where X is a random variable obtaining the value li with probability pi, i = 1, . . . , N. By Lemma 3.17, V ar(X) ≥ 12mink{pk}(maxi{li} −mini{li})2. For l ∈ N(1) it holds that maxi{li} = 1 and mini{li} = 0. In addition, for L ∈ [0, a]N, the smallest value for mink{pk} is obtained when one entry of L is a and the rest are 0, yielding that mink{pk} ≥ N−1+ee−a−a. Together we have that−l⊤∇2Φ(L)l≥ 12ãN−1+ee−a−a = 2(N−11)ea+2. Since the expression −l⊤∇2Φ(L)l is continuous in l, this inequality holds even if the assumption that the entries oflare distinct numbers is dropped. The result follows.
Theorem 3.19. For any 0 < α < 1/N and η > 0, it holds that maxt{RHed(η),t} ≥ min{1ηln(N(1N−1−α)),14ηαq}, and for α = 1/(2N) and η = q
(8N/q) ln2N−12N−2, maxt{RHed(η),t} ≥ 4N√q ãq
2N
2N−1 = Ω(√q/N).
Proof. By Theorem 3.13, for every η > 0, it holds that maxt{RHed(η),t} ≥ min{ρ1η(a),η2ρ2(a)q}. We fixa, observing that anya∈(0,∞) may be used, and denote α= (N−1)e1 a+1. Note that α may thus obtain any value in (0,1/N). By Lemma3.18, we have thatρ2(a)≥α/2. By Lemma3.15,ρ1(a) = lnN−1+eN −a, and since
N(1−α)
N −1 = N
N −1 ã (N −1)ea
(N −1)ea+ 1 = N ea
(N−1)ea+ 1 = N N−1 +e−a ,
we have thatρ1(a) = lnN(1−α)N−1 . Thus, maxt{RHed(η),t} ≥min{1ηlnN(1−α)N−1 ,14ηαq}. We may maximize the r.h.s. by picking η = q
4
αqlnNN(1−−1α), yielding maxt{RHed(η),t} ≥