The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science Machine Learning Algorithms with Applications in Finance Thesis submitted for the degree of Doctor of.
Arbitrage-Free Pricing
The Arbitrage-Free Assumption
From arbitrage reasoning, the existence of such strategies ties the option price to the outcomes of trading plans An option guarantees a future payoff at a given price, while a one-time setup trading strategy that then continuously trades and ends up with more money than the option’s payoff implies the option cannot cost more than the strategy’s initial setup cost, establishing an upper bound Similarly, a strategy that always ends with less money than the option’s payoff provides a lower bound on the option price Thus, the option price must lie between these bounds dictated by feasible strategies and the option’s payoff.
The intuitive idea is that if one asset is consistently worth more than another asset in the future, it should be priced higher today as well If that weren’t the case, traders could buy the cheaper asset now, sell the more expensive one, wait for the price order to reverse, and lock in a riskless profit on the entire trade Such an arbitrage opportunity cannot persist, because traders rush to buy the asset that is cheaper now and sell the asset that is more expensive This process drives prices toward parity and underpins the arbitrage-free assumption—the belief that obvious arbitrage opportunities do not exist in efficient markets.
Regret Minimization
Arbitrage-free pricing requires a trading strategy whose realized returns always exceed the payments owed by the option writer A trivial example is to buy dollars immediately, but this approach performs poorly if the dollar depreciates against the euro The writer would improve her position if she could know the exchange rate at expiration: if the dollar is expected to appreciate, she should hold dollars; if the euro is expected to appreciate, she should hold euros Lacking perfect foresight, she would seek a strategy that avoids large underperformance relative to the best of those two options In other words, she can look for strategies that minimize regret, which can be developed within the framework of online learning theory.
Online Learning
Specific Settings of Online Learning
Properties of an online learning game depend on the precise details of its decision sets, the chosen loss function, the exact nature of information revealed to the learner in each round, and any additional restrictions Below are the key settings and modes that influence how the game operates.
The best expert setting is perhaps the most widely researched type of online game.
In this online learning framework, the adversary selects a bounded real-valued vector, and the learner chooses a probability distribution over the same components The learner’s loss is the dot product of these two vectors, tying the adversary’s costs to the learner’s randomized action The adversary’s vector encodes the cost of following the advice of several experts, while the learner’s probability vector represents a random selection of a single expert The term “experts” encompasses a broad set of decision rules—distinct heuristics, routes to work, or even choices about which advertisements to place on a website—capturing how different guidance sources shape outcomes.
Online convex optimization is a strict generalization of the best expert setting In this framework, the learner selects a point from a fixed convex and compact subset of Euclidean space, and the adversary then picks a convex function defined on that set mapping to the real numbers The learner’s loss is the value of the adversary’s function evaluated at the learner’s chosen point.
Across these two settings, a key distinction is the feedback the learner receives after each round In a full-information scenario, such as weather experts predicting the forecast, the learner has access to the losses of all experts, providing complete feedback In contrast, when the task involves ad placement or choosing from a pool of slot machines, the learner observes only the loss of the single choice it made This limited feedback mode is known as the multi-armed bandit, or simply the bandit, setting.
In online learning, crucial distinctions arise from the sequence of choices made by the adversary The amount of fluctuation in this sequence, which can be measured in various ways, affects the regret bounds the learner can achieve A low level of variability helps the learner track the adversary’s moves, making learning more efficient Redundancy in the expert sets can also facilitate the learner’s performance: if there are only a small number of high-quality experts, or if many experts are near-duplicates of others, better regret bounds may be achieved These aspects and other variants of the online learning theme are explored in depth throughout this work, particularly in Chapters 3 and 4.
Competitive Analysis and Pricing
Recall the company owner who pulled through a euro devaluation using options on the exchange rate The tide has turned, and an unexpected appreciation in the euro-dollar rate means that he has a surplus in euros He intends to change these euros for dollars and invest in expanding his business The exchange rate fluctuates continually, and he would naturally seek to make the change at the highest possible rate This is the problem ofsearch for the maximum orone-way trading.
Within the next month, the owner decides to implement a currency strategy While he cannot predict exactly how exchange rates are generated, he assumes the rate will stay within a reasonable band, not rising more than four times the current level or falling to a quarter of it The plan is to convert three quarters of the euros immediately, and to convert the remaining quarter only if the rate exceeds twice the current rate; if that condition never occurs, he will simply sell the remaining euros at the end of the month.
Assume the current rate is 1 and it never doubles to 2; in the worst case for the owner's regret, the rate climbs to 2 and then collapses to 1/4 Under this path, the owner could have earned 2 with the benefit of hindsight, but ends up with (3/4) × 1 + (1/4) × (1/4) = 13/16, a ratio of 32/13 If the rate does exceed 2 at some point, the worst-case scenario is that it moves up to 4, and the analysis then identifies the corresponding optimal rate.
Under the owner’s online strategy, the actual sum is (3/4)α1 + (1/4)α2 = 5/4, producing a competitive ratio of 16/5, which is higher than in the first case If the owner instead converts all euros immediately and the exchange rate jumps to 4, the ratio between the offline optimum and the realized return is 4, exceeding 16/5—the maximum ratio achievable by the online strategy considered here Therefore, a simple immediate-conversion approach can improve the owner’s competitive ratio, which bounds the gap between the best possible hindsight strategy and the online outcome [85].
Besides one-way trading, a company owner can buy insurance against a rise in the euro's value over the next month by using a European fixed-strike lookback call option This security gives the holder the right to receive the difference between the highest asset price observed during the option’s life and a pre-set strike price If the maximum price surpasses the strike at any point from issuance to expiration, the option pays out the difference The key distinction from a standard European call option is that payoff depends on the maximum asset price achieved, not just the final price at expiration.
An option writer must protect herself against every contingency by calculating the cash reserve needed to cover her obligation, which in turn informs the price she charges for writing the option By employing a one-way trading strategy, she can secure a minimum ratio between the exchange rate she ultimately receives and the highest rate she might owe to the option holder, effectively hedging currency risk Providing enough cash for the one-way trade guarantees coverage under the basic assumptions, delivering clear risk management and pricing discipline for option writing.
Payment activity hinges on a critical threshold: a very high threshold (for example, one hundred times the current rate) makes any payment virtually impossible, while a zero threshold guarantees that payment will occur This threshold effect governs the feasibility of transactions under changing rates and highlights how the probability of payment collapses or persists Chapter 7 shows that the option writer can guard against both rate movements and threshold-crossing by combining one-way trading with a regret-minimization algorithm, delivering a practical framework for rate risk management and more reliable payments.
An Overview of Related Literature
Derivative Pricing in the Finance Literature
Derivative pricing has a long-established literature, with the Black–Scholes and Merton work on pricing the European call option providing the core insight: in an arbitrage-free market where stock prices follow a geometric Brownian motion, the option’s payoff can be replicated exactly by dynamically trading the underlying stock, which in turn determines the option’s price The Black–Scholes–Merton framework has become the foundational tool for valuing a wide range of derivatives Of particular relevance is the fixed-strike lookback option and the broader result that call-option prices can pin down the value of any stock derivative whose payoff depends only on the terminal stock price.
The assumptions of the Black-Scholes-Merton model have long clashed with empirical evidence For example, actual price paths exhibit discrete jumps rather than the smooth, continuous dynamics the model presumes, prompting the development of jump-diffusion frameworks such as the jump-diffusion model discussed in [69] Additional empirical findings show that asset prices do not follow a lognormal distribution, as implied by the geometric Brownian motion underpinning the model, with support in [67] These inconsistencies highlight the need for alternative pricing approaches that better capture observed market behavior, including abrupt price changes and heavier tails.
The volatility smile describes the phenomenon where call options with different strike prices on the same asset imply distinct estimates of the volatility parameter in the Brownian motion model that underpins asset prices This discrepancy has spurred extensive research into replacing Brownian motion with more flexible Lévy processes, broadening the toolkit for option pricing and financial modeling (see [29] and related literature) to better capture market dynamics.
[79] for a coverage of L´evy processes and their uses in finance and pricing in particular).While the predominant approach to derivative pricing in the financial community re-
1.4 AN OVERVIEW OF RELATED LITERATURE 9 mains that of stochastic modeling, there are some results on robust, model-independent pricing We mention here the works by Hobson et al., who priced various derivatives in terms of the market prices of call options with various strikes (see [54] for a review of results) These works assume an arbitrage-free market, but otherwise make only minimal, non-stochastic assumptions Given a derivative, they devise a strategy that involves trading in call options and always has a payoff superior to that of the deriva- tive; thus, the cost of initiating the strategy is an upper bound on the derivative’s price.
Fixed-strike lookback options with zero strike constitute a special case in which the strategy reduces to one-way trading in call options, and under the assumed model the resulting price bound is tight This setup highlights that restricting trades to calls can yield a remarkably narrow pricing bound for lookback payoffs, clarifying the impact of the zero-strike configuration on the model's bounds.
Regret Minimization
Regret minimization research is primarily a creation of the last two or three decades, but its roots can be traced to 1950’s works, which were motivated by problems in game theory.
Hannan [45] gave the first no-regret algorithm in the context of a repeated game, where a player wishes to approximate the utility of the best action with hindsight.
By adding a random perturbation to the sum of past utilities for each action and selecting the action with the smallest perturbed utility, the strategy known as Follow the Perturbed Leader is formed, and it is shown that the per-round regret of this approach tends to zero as the number of game rounds increases, regardless of how the other players behave.
Blackwell extended approachability to vector-valued payoffs, generalizing the two-player zero-sum repeated game where the utility (or loss) matrix contains vector elements In the one-dimensional case, von Neumann’s minimax theorem guarantees a strategy whose average payoff can be driven toward the real numbers bounded above by the game value Blackwell’s approachability theorem then characterizes the approachable target sets in higher dimensions: a convex, closed set is approachable precisely when every containing half-space is approachable The proof is constructive and yields an algorithm with a convergence rate that scales as the inverse square root of the number of rounds This framework also gives rise to a regret-minimization algorithm for two-player games, consistent with the problems studied by Hannan.
The question whether Blackwell’s results somehow stand apart from later work on regret minimization or are subsumed by it was recently answered by the work of
Researchers have established an equivalence between the approachability theorem and no-regret learning for a subset of online convex optimization Specifically, any algorithm designed for a problem in Blackwell’s setting can be efficiently transformed into an algorithm for an online convex optimization problem with linear loss functions, and the reverse transformation is also possible.
The best experts setting The most well-known algorithm for this setting is the Hedge or Randomized Weighted Majority algorithm, which was introduced by several authors [38,65,89] This algorithm gives to each expert a weight that decreases expo- nentially with its cumulative loss, and then normalizes the weights to obtain probability values The rate of this exponential decrease may be controlled by scaling the cumu- lative loss with a numeric parameter, called the learning rate This weighting scheme may be implemented by applying on each round amultiplicative update that decreases exponentially with the last single-period loss This algorithm achieves a regret bound ofO(√
TlnN) for any loss sequence, whereT is the horizon, or length of the game,N is the number of experts, and the learning rate is chosen as an appropriate function of both The bound is optimal since any online learner achieves an expected regret of the same order against a completely random stream of Bernoulli losses.
Whether or not the horizon is known to the learner, the result remains valid: a bound on the horizon can be guessed, and if that guess fails, the learner doubles the bound and restarts the algorithm, adjusting the learning rate in the process This doubling trick, described in references [24, 90], changes the regret bound only by a multiplicative constant.
An alternative approach for handling the case of an unknown horizon was given in
[9], where a time-dependent learning rate was used, yielding a better constant in the bound.
Although the classic bound is optimal, it ignores the actual structure of the loss sequences, prompting later work to derive tighter regret bounds under refined scenarios In particular, prior research examined the case where the best expert accumulates only a small total loss, showing that by tuning the learning rate one can obtain an improved bound in which the horizon is effectively replaced by the cumulative loss of the best expert (up to an additive logarithmic term).
1.4 AN OVERVIEW OF RELATED LITERATURE 11 rithmic factor) This result shares two features with the horizon-based bound First, a bound on the cumulative loss of the best expert may be guessed using a doubling trick, rather than be known in advance Second, a matching lower bound may be obtained using a random adversary (a trivial modification of the previous one).
Subsequent work replaced dependence on the horizon with measures of loss variation, and the first notable result in this direction is the Polynomial Weights (Prod) algorithm, a small but meaningful modification of Hedge that replaces Hedge’s exponential multiplicative update with its first-order approximation—a linear function of the last period’s loss The resulting regret bound preserves the same form as before, but the horizon is replaced by a known bound on the maximal quadratic variation of the losses across experts, where quadratic variation is defined as the sum of squared single-period losses By using more elaborate doubling tricks, this bound on quadratic variation can be inferred, yielding a bound that involves the maximal quadratic variation of the best experts throughout the game This bound, however, includes additional factors that grow logarithmically with the horizon.
Authors in [49] derive an improved regret bound that depends on an adversarial analogue of the variance of a random variable As noted in [25], this dependence is more natural because, for random losses, regret scales with the square root of the variance Accordingly, their notion of variation for each expert is defined as the sum of squared deviations from the expert’s average loss, a quantity that is necessarily smaller than the quadratic variation used in [25] Their algorithm, Variation, exploits this per-expert variation to achieve the tighter bound.
In online learning, the Multiplicative Weights (MW) approach refines Hedge by exponentially down-weighting experts based on their variation and cumulative loss As a result, highly variable or consistently underperforming experts contribute less to the aggregate forecast, while stable, strong performers gain influence If a prior bound on the variation of the best experts across the game is available, the resulting regret bound becomes a function of that variation, providing tighter guarantees for the MW-Hedge framework.
This work replaces the time horizon in the standard regret bound with a variation-based quantity within the multiplicative weights (MW) framework The authors show that this variation upper bound is unnecessary, and by employing sophisticated doubling tricks they derive a regret bound that is worse by an additive factor that scales logarithmically with the horizon.
Recent work introduces a variation-based notion suitable for settings where the single-period losses of all experts change only slightly from round to round They define this variation by squaring the maximal per-round change and summing over all rounds With Hedge using a learning rate tied to this variation, the regret bound replaces the horizon with the total variation A standard doubling trick ensures that no prior knowledge of the variation is required The bound is shown to be optimal, though it seems incomparable with the bound in [49].
Online convex optimization, introduced by Zinkevich, studies sequential decision-making where a learner picks a point from a compact convex subset of Euclidean space and incurs a loss defined by an adversarially chosen convex function, with regret measured against the best fixed decision The framework encompasses the best-experts setting, where decisions are probability vectors and losses are linear, and extends to problems like portfolio selection—where the decision is an allocation vector and the loss is logarithmic—and online path planning, where a commuter each day chooses road segments and faces 0/1 decisions on usage Although online path planning can be framed as an exponential-experts problem, the online convex optimization formulation enables applying convex optimization tools to devise regret-minimization algorithms.
In this setting, Zinkevich introduced the Online Gradient Descent (OGD) algorithm and demonstrated that it achieves vanishing regret OGD, together with Hedge, can be viewed as special cases of a broader meta-algorithm called Regularized Follow the Leader (RFTL) RFTL refines the greedy Follow the Leader strategy by incorporating a strongly convex regularizer into the cumulative loss before minimizing, producing a more nuanced and robust online learning method.
Robust Trading and Pricing in the Learning Literature
Research at the intersection of machine learning and finance has historically emphasized adversarial scenarios over stochastic problems, with trading often modeled in discrete time The bulk of work focuses on robust portfolio optimization algorithms that guarantee performance, while other studies address derivative pricing under an arbitrage-free market and aim to keep modeling assumptions minimal Importantly, the arbitrage-free assumption enables strategies with robust guarantees to yield bounds on derivative prices, linking trading and pricing in a way that is not strictly separate As a result, the line between trading and pricing becomes blurred, reflecting a unified view of financial decision-making in ML-enabled finance.
Portfolio selection is a central problem in online learning, where an algorithm must trade across several assets in an adversarial market with the goal of maximizing returns The straightforward optimal strategy—on each round invest everything in the asset that will gain the most next period—depends on future information and is impossible to compete with in practice Therefore, the learning objective shifts from maximizing raw returns to minimizing regret with respect to a diverse set of benchmark strategies, ensuring the online algorithm achieves returns close to the best benchmark in hindsight In financial settings, regret is measured by the ratio of final wealth (percentage returns), not the absolute difference, reflecting performance relative to starting capital By selecting a rich and powerful set of benchmark investment strategies, regret minimization aims to provide guarantees that translate into strong, scalable performance across adversarial markets.
Cover introduced the first robust portfolio selection method by using the class of constantly rebalanced portfolios (CRPs) as the benchmark CRPs are strategies that keep a fixed fraction of wealth in each asset by rebalancing to those fractions, so the best CRP can outperform even the best single asset The universal portfolio, Cover’s algorithm, is obtained by distributing wealth uniformly across all possible CRPs at the start and then doing nothing else He shows that the ratio between the final wealth of the best CRP and that of the universal portfolio is upper bounded, meaning the universal portfolio performs nearly as well as the best CRP.
1.4 AN OVERVIEW OF RELATED LITERATURE 17 by a polynomial in the number of rounds In other words, the wealths of the two al- gorithms have the same asymptotic growth rate The computational complexity of the algorithm, however, is exponential in the number of assets Cover’s result prompted subsequent work that incorporated side information and transaction costs, proved that the regret of the universal portfolio is optimal, improved computational efficiency, and considered short selling [14,32,58,73,91] The work of [52] tackled the problem with a different algorithm using a simple multiplicative update rule Their algorithm had linear complexity in the number of assets, but gave worse regret bounds compared with the universal portfolio.
Portfolio selection was cast within the online convex optimization framework in [3] In this setup, decisions are probability vectors that allocate wealth across assets, and the single-period loss is defined as the negative logarithm of the asset’s price ratio for that period; equivalently, the algorithm’s loss is the negative logarithm of the dot product between its decision vector and the asset price ratio vector The authors show that the Online Newton Step algorithm—efficiently implementable—achieves regret that grows logarithmically with the number of rounds This horizon dependence matches that of Cover’s algorithm once standard additive regret is translated into multiplicative terms Moreover, the decision space is exactly the set of constantly rebalanced portfolios.
Variational results developed in [48] and [28] apply to portfolio selection when formulated as online convex optimization Consequently, the horizon in logarithmic regret can be substituted by a variation measure In both approaches, the variation vectors are the assets’ single-period price ratios, equivalently the assets’ percentage returns.
The result in [28] is stronger in the sense that the variation of [28] can be bounded by the variation of [48], but not vice versa These bounds substantially improve over horizon-dependent bounds under the realistic assumption that variability is much smaller than the number of trading periods Moreover, experiments with the algorithm of [48] show that its regret remains essentially unchanged as trading frequency increases.
Benchmarks beyond constantly rebalanced portfolios have also been considered The algorithms described in [84] achieve bounded regret with respect to the best switching regime among several fixed investment strategies, with and without transaction costs This benchmark was further examined in [62] for a two-asset portfolio consisting of stock and cash Additional results discussed later in the context of derivative pricing can also be interpreted as referring to different benchmark sets.
There are other approaches that rather seek to directly exploit the underlying statis- tics of the market [17,44], but without assuming a specific price model The authors of
[44] show that their methods achieve the optimal asymptotic growth rate almost surely, assuming the markets are stationary and ergodic Both these works do not, however, provide robust adversarial guarantees.
Several cited studies include experiments on actual price histories, underscoring gaps in the present theoretical framework Specifically, the robust algorithms described in [3, 52, 84] outperformed the optimal universal portfolio on real market data, and in some cases even surpassed the best constantly rebalanced portfolio Similarly, the methods in [17, 44], though characterized by weak or absent guarantees, delivered remarkably high yields.
Derivative pricing in the Black-Scholes-Merton framework relies on constructing a trading strategy that exactly replicates the derivative's payoff and using the arbitrage-free condition to determine its price In adversarial settings, replication may not be exact, yet the same pricing principle yields meaningful bounds by framing the interaction as a game between the market and the investor When exact replication is impossible, an upper price bound can still be obtained by identifying a strategy whose payoff super-replicates the derivative, with the setup cost of that strategy serving as the bound; similarly, a lower bound follows from a sub-replicating strategy whose payoff is always less than or equal to the derivative's payoff.
Recent work by [80] shows that the Black-Scholes-Merton framework can be extended to an adversarial setting provided a tradable variance derivative exists; this derivative pays periodic dividends equal to the squared relative change in the stock price, and the investor’s strategy involves trading both the stock and the derivative The analysis applies to both discrete and continuous time, but it imposes strong assumptions on the smoothness of the price processes for both the stock and the derivative In the broader literature, the European call option was priced in a very general adversarial setting in the work of [35], where the discrete-time model includes two parameters: a bound on the sum of the squared single-period returns of the stock (quadratic variation) and a bound on
1.4 AN OVERVIEW OF RELATED LITERATURE 19 on its absolute single-period returns The quadratic variation serves as an adversarial counterpart of stochastic volatility Apart from these two constraining parameters, the model is completely adversarial and allows for price jumps and dependence.
Two methods yield upper bounds for call option prices: one converts a regret-minimization algorithm for the best-expert setting into a super-replication strategy, bounding the initial cost by the regret, while the other directly computes the minimal super-replication cost of the option payoff to obtain an optimal arbitrage-free upper bound through a recursive minimax price and strategy, efficiently approximable with dynamic programming The optimal bound’s trading strategy is unrestricted by borrowing or short selling, whereas the regret-based strategy requires neither They also derive a lower bound on the price for a specific strike using a tailored sub-replication strategy The optimal upper bound from [35] is only slightly worse than the Black-Scholes-Merton price and exhibits a practical volatility smile; in some settings, the regret-based price depends on the square root of the quadratic variation (like Black-Scholes-Merton for small variation), while the lower bound depends linearly on the quadratic variation, hence suboptimal.
This thesis extends the regret-minimization framework from [35] and develops its mathematical details in the financial portion The central idea is that robust call-option pricing can be cast as a robust portfolio‑selection problem with two assets—stock and cash—and a simple benchmark: hold cash or hold the stock With a guaranteed lower bound on the ratio between an algorithm’s final wealth and the best strategy in the set, the no-arbitrage condition yields an upper bound on the price of any derivative that replicates the best-performing strategy’s payoff Because the derivative’s payoff differs from that of a call option by only a fixed amount—the strike price—the same pricing bound transfers to the call option as well.
The principle used in [35] for a very specific benchmark set can be generalized to apply to any benchmark set Therefore, the results for the portfolio selection algorithms described in [3, 28, 31, 48] imply arbitrage-free price upper bounds for a class of benchmarks.
5 The regret minimization-based strategy does require that the investor possess the strike price in cash before trade begins.
(theoretical) derivative that pays the same as the best constantly rebalanced portfolio. For the simple benchmark set of holding cash or holding the stock, the authors of
Competitive Analysis and One-Way Trading
Competitive analysis, introduced in the classic work [85], seeks robust performance guarantees for online algorithms relative to the best offline algorithm, formalized through the competitive ratio—the worst-case ratio between the offline optimum and the online algorithm’s performance [16] This framework differs slightly from regret minimization, which aims to minimize the worst-case difference between the best algorithm in a benchmark set and the online algorithm’s performance [16] Despite long-observed similarities, competitive analysis and online learning have largely evolved separately, with only a few works offering a unified framework [13][22] The thesis’ themes intersect competitive analysis in the one-way trading problem [36], where one asset is converted to another under fluctuating exchange rates to maximize returns; the offline optimum simply sells at the highest rate, while the online algorithm strives to maximize the competitive ratio, a problem equivalent to finding the maximum of a series since a gradual sale can be viewed as selecting a single selling point at random [36] The results in [36] present an algorithm with the optimal competitive ratio under known upper and lower bounds on the exchange rate and extend to cases where only the ratio of bounds is known and the horizon may be unknown.
In one-way trading, the gap between competitive analysis and regret analysis essentially disappears, and the problem can be formulated as a two-asset portfolio selection task The natural benchmark comprises all one-way trading strategies—or, more simply, strategies that sell the first asset in a single trade.
The work of [36] is connected to option pricing in [66], which studies the search problem of identifying the k highest or lowest values The authors extend this framework by applying competitive analysis to price floating-strike lookback calls, a class of lookback options whose payoff equals the difference between the stock price at a future time and the minimum price reached over the option’s lifetime.
Contributions in This Dissertation
Contributions to the Theory of Regret Minimization
We establish universal lower bounds on individual sequence regret in online convex optimization with linear loss functions, showing that these bounds hold for any loss sequence The results apply to algorithms whose weight vector at time t+1 is the gradient of a concave potential function of cumulative losses up to time t, a framework that encompasses all linear Regularized Follow the Leader regret minimizers Consequently, major regret minimization methods such as Hedge and Online Gradient Descent are included within this class, illustrating the broad applicability of the lower bounds across key algorithms in the field.
We begin by proving that the algorithms in this class are precisely those that guarantee non-negative regret for any loss sequence This result is surprising, because the class includes algorithms explicitly designed to minimize regret uniformly across all sequences.
A sharper trade-off result is obtained for the anytime regret, namely, the maximal regret during the game We present a bound for the anytime regret that depends on the quadratic variation of the loss sequence, Q_T, and the learning rate Nevertheless, we show that any learning rate that guarantees a regret upper bound of O(√T) across all sequences must account for Q_T, indicating that the bound cannot be achieved without considering the loss variation.
Q T ) anytime regret onany sequence with quadratic variationQ T
We prove our results for potentials with negative definite Hessians and for potentials in the best-expert setting that satisfy natural regularity conditions In the best-expert setting, the results are expressed via a translation-invariant version of quadratic variation We then apply these lower bounds to Hedge and to the linear-cost version of Online Gradient Descent.
Algorithms for scenarios with limited expert set complexity acknowledge that assuming a fully malicious adversary may be overly pessimistic, while refined assumptions are often appropriate In the best expert setting, an adversary is constrained by dependencies or redundancy within the expert pool, which reduces the effective number of competing experts below the nominal count Consequently, regret minimization bounds are derived that hinge on the realized complexity of the expert class rather than solely on the nominal number of experts This realized complexity is defined retrospectively from the actual losses observed, capturing the true difficulty of the problem By substituting the nominal expert count with measures of realized complexity, the framework yields tighter performance guarantees in practical scenarios.
Here we study two natural complexity regimes in online learning In the first regime, complexity is measured by the number of distinct leading experts—those who are best at some point in time—and we derive regret bounds that depend only on this count, independent of the total number of experts In the second regime, where experts cluster into a small number of groups based on their realized cumulative losses, the regret depends solely on the number of retrospectively determined clusters, a meaningful measure of complexity We show these bounds are tight by proving matching lower bounds on the expected regret against carefully constructed stochastic adversaries Furthermore, our bounds improve upon those of general-purpose algorithms like Hedge in scenarios where the number of clusters or leaders grows logarithmically with the total number of experts, enabling an extra layer of regret minimization to choose between specialized algorithms and Hedge to achieve the best possible bound, up to constant factors, regardless of the adversary’s behavior.
Our results are obtained as special cases of a more general analysis for a novel setting of branching experts, where the set of experts may grow over time according to a tree-like structure, determined by an adversary This setting is of independent interest since it is applicable in online scenarios where new heuristics (experts) that are variants of existing heuristics become available over time For this setting of branching experts, we give algorithms and analysis that cover both the full information and the bandit scenarios.
Contributions to Derivative Pricing
Pricing a variety of options We apply a unified regret minimization framework to pricing a variety of options based on the method of [35] We give variation-based upper bounds on the prices of various known options, namely, theexchange option, theshout option, and several types of Asian options We derive these bounds by considering a
1.5 CONTRIBUTIONS IN THIS DISSERTATION 25 security whose payoff is the maximum of several derivatives We price this security based on regret bounds with respect to the underlying derivatives and then show how to express the above options in terms of it.
Pricing convex path-independent derivatives: We establish robust, variation-based upper bounds on the prices of a broad class of derivatives whose payoff at expiration is a convex function of the underlying asset price Since the payoffs are convex in the underlying, these results yield model-free, conservative price bounds that hold under mild market assumptions By tying these bounds to a robust upper bound on European call option prices, we derive practical upper price limits for convex path-independent derivatives, enabling safer pricing and hedging in markets where only call price information is available.
We present a new family of regret-minimization algorithms that fuse a regret-minimization component with a one-way trading module (which may only sell) These algorithms come with regret guarantees not only against the best asset at the end of the horizon but also against the maximal value achieved during the entire process By translating the combined performance into explicit variation-based upper bounds, we obtain price bounds for fixed-strike lookback options Unlike strictly one-way traders, the resulting schemes are two-way trading algorithms that can both buy and sell assets, offering a novel approach to searching for the maximum of an online sequence (the problem studied in [36]) Moreover, we show that our methods can attain better competitive ratios than the optimal one-way trading algorithm of [36].
We present a method for directly applying regret bounds to pricing by deriving a new formula that translates regret guarantees into a financial setting This framework lets regret bounds be converted without specially modifying existing best‑expert algorithms to operate in trading environments, where regret is multiplicative rather than additive Consequently, the analytical burden is reduced: one can reuse established regret bounds in the pricing context instead of proving new ones for modified algorithms This direct bridge between learning theory and market pricing widens the practical use of regret analyses in financial decision making The approach delivers a clean, plug‑in pathway from regret guarantees to pricing performance, enabling practitioners to leverage existing results in automated pricing systems.
We apply this method to obtain new variation-based price upper bounds that are applicable in broader settings than that of [35].
For the first time, we provide a robust lower bound on the price of at-the-money call options using regret minimization techniques, and this lower bound exhibits the same asymptotic behavior as the robust upper bound previously shown in [35] An at-the-money call option is defined as one whose strike equals the stock price at issue The bound is obtained by combining our results on the anytime regret of individual sequences with a conversion method that translates regret bounds for best-expert algorithms into the financial setting This price bound mirrors the Black-Scholes-Merton asymptotics, depending on the square root of the quadratic variation, even under minimal, adversarial assumptions rather than stochastic ones.
Outline of This Thesis
This work is organized into two parts: Part I delivers theoretical contributions to the theory of regret minimization itself, and Part II examines the applications of regret minimization to pricing derivatives.
PartI is organized as follows In Chapter 2 we give some required definitions and background Chapters3 and 4then present two distinct works.
Chapter 3 develops variation-based regret lower bounds that hold for any loss sequence and apply to a broad class of learners in the online linear optimization setting These results, published in [42], demonstrate that regret is fundamentally limited by variation across sequences and extend across diverse online learning algorithms.
Chapter 4 develops improved regret bounds for best-expert scenarios involving redundant sets of experts These results are obtained in a more general framework where new experts can branch from existing ones, broadening the applicability of the analysis beyond standard models The chapter builds on the work published in [43], reinforcing the theoretical foundations for adaptive expert algorithms and their performance guarantees.
Part II opens with Chapter 5, which provides essential background and definitions to set the stage for the analysis In Chapter 6, the authors apply a specially adapted regret minimization algorithm (drawing on [35]) to derive variation-based upper bounds on the prices of a broad class of derivatives These results build on prior work documented in [41].
Chapter 7 offers an in-depth treatment of lookback option pricing, deriving practical upper price bounds via algorithms that merge regret minimization with one-way trading, with these results grounded in the framework developed in [40].
Chapter 8 shows how bounds on the loss and regret of algorithms for the best-expert setting can be directly translated into price bounds for financial instruments This approach is then used to establish a variation-based lower bound on the price of a particular call option, as well as to derive new upper bounds on option prices The chapter draws in part on results from reference [42].
This section is devoted to the pure theoretical aspects of regret minimization Without further ado, we formally introduce the essential definitions, fundamental facts, and the notation that will be used in the remainder of the discussion.
Regret Minimization Settings
The best expert setting is a simple, well-researched framework that exemplifies regret minimization in online learning In this canonical model there are N experts (or actions), and at each time step t = 1,…,T an online algorithm A (the learner) selects a distribution p_t over the N experts Simultaneously, an adversary chooses a loss vector l_t = (l_{1,t},…,l_{N,t}) in R^N, and the learner incurs loss l_{A,t} = p_t · l_t (the inner product of p_t and l_t) This interaction, between the learner’s probabilistic decision and the adversary’s losses, forms the core of regret analysis against the best fixed expert in hindsight.
L i,t =Pt τ=1l i,τ for the cumulative loss of expertiat timet,L t = (L 1,t , , L N,t ), and
L A,t =Pt τ=1l A,τ for the cumulative loss of A at time t The regret of A at time T is
R_{A,t} = L_{A,t} − min_j {L_{j,t}} The aim of a regret minimization algorithm is to achieve small regret regardless of the sequence of loss vectors chosen by the adversary The anytime regret of A is the maximal regret over time, namely max_t {R_{A,t}} We will sometimes use m(t) = argmin_i {L_{i,t}}, where we take the smallest index in case of a tie We also denote L^*_t = L_{m(t),t}, the cumulative loss of the best expert after t steps.
It is customary to impose an explicit bound on the range of single-period losses, namely l_i,t ∈ [0,1] for every i and t; this restriction is indeed assumed in Chapter 4, while Chapter 3 does not impose any explicit restriction on the losses’ range.
In online learning with experts, the learner's strategy can be viewed as a probability distribution over the experts, effectively a random choice among them guided by past performance This decision is made with full information about the losses incurred by every expert so far, enabling precise weighting toward better performers The Hedge algorithm is the most notable method for this setting, and its details are described below [38,65,89].
Parameters: A learning rateη >0 and initial weightsw i,1 >0, 1≤i≤N.
Given a bound on the absolute value of the single-period lossesl i,t , Hedge may be shown to have bounded regret regardless of the loss sequences chosen by the adversary.
If the learning rateη is tuned solely as a function of time, we achieve the so-called zero order regret bounds, which have the formO(√
In contrast to the full-information setting, the adversarial multi-armed bandit problem restricts the learner’s observations in every round to the loss incurred by the chosen expert, making feedback incomplete and learning harder The EXP3 algorithm, a bandit adaptation of Hedge for this setting, uses exponential weighting to balance exploration and exploitation and achieves a zero-order bound on regret on the order of the square root of the time horizon.
Here we present a version of Exp3 taken from [21, Chapter 3], which is slightly simpler than the original formulation With an appropriate choice of the time-varying learning rate ηt, this variant achieves the same bound for the weaker notion of pseudo-regret, offering a practical yet theoretically sound option for analyzing the expected regret in adversarial bandit settings.
Parameters: A non-increasing sequence of real numbers η1, η2,
Letp 1 be the uniform distribution over{1, , N}.
1 Draw an action It from the probability distribution pt.
2 For each action i= 1, , N, compute the estimated lossel i,t = l i,t pi,t
I{I t =i} and update the estimated cumulative loss Le i,t =Le i,t − 1 +el i,t
3 Compute the new distribution over actionsp t+1 = p 1,t+1 , , p N,t+1
The Online Linear Optimization Setting
Online linear optimization treats the best expert setting as a special case In this framework, the online learning algorithm, or linear forecaster, at time t selects a weight vector x_t ∈ K, where K ⊂ R^N is compact and convex, and incurs a loss ⟨x_t, l_t⟩ The regret of a linear forecaster A over T rounds is R_{A,T} = L_{A,T} − min_{u ∈ K} ⟨u, L_T⟩, where L_{A,T} is the cumulative loss of A and L_T is the cumulative loss vector The best expert setting corresponds to K = Δ^N, the probability simplex over N elements.
A no-regret result for the online linear optimization setting is achieved by the Reg- ularized Follow The Leader (RFTL) algorithm, defined below.
Parameters: A learning rate η > 0 and a strongly convex regularizer function
Update the weight vectorx t+1 = arg min x∈K {xãL t + (1/η)R(x)}.
For a continuously twice-differentiable regularizerR, RFTL guarantees, for a proper
Online linear optimization is a special case of online convex optimization In this broader framework, the algorithm incurs a loss f_t(x_t) with f_t a convex function chosen by the adversary, and the regret is P_t f_t(x_t) − min_{u ∈ K} P_t f_t(u) With an appropriate choice of the learning rate η, one can obtain a regret bound of O(√(λ D T)), where D = max_{u ∈ K} [R(u) − R(x1)] and λ = max_{t, x ∈ K} l_t^⊤ [∇^2 R(x)]^{-1} l_t, linking the geometry of the feasible set, the curvature of the potential R, and the adversary’s loss vectors This demonstrates how online linear optimization inherits guarantees from online convex optimization and achieves efficient, regret-minimizing performance over time.
The RFTL algorithm may be shown to generalize Hedge, where the regularizer function is chosen to be the negative entropy function, namely,R(x) =PN i=1x i logx i
It also generalizes the Lazy Projection variant of the Online Gradient Descent algorithm
(OGD), defined by the rulex t+1 = arg minx ∈K{kx+ηL t k2}, by choosingR(x) = 1 2 kxk 2 2 For an in-depth discussion of the RFTL algorithm, see [47].
We define the quadratic variation of the loss sequence l1, , lT as Q_T = sum_{t=1}^T ||l_t||_2^2 In the best-expert setting, we use a slightly different notion, the relative quadratic variation q_T = sum_{t=1}^T δ(l_t)^2, where δ(v) = max_i v_i − min_i v_i for any v ∈ R^N Note that q_T ≤ 2 Q_T holds always.
We denoteQ for a known lower bound onQT andq for a known lower bound onqT.
Convex Functions
We mention here some basic facts about convex and concave functions that we will require For more on convex analysis, see [77], [19], and [71], among others.
We discuss functions defined on R^N A function f: C → R is convex if C is a convex subset of R^N and for every x,y in C and every λ in [0,1], f(λx + (1−λ)y) ≤ λ f(x) + (1−λ) f(y) The function f is concave if −f is convex The function f is strictly convex if the inequality is strict for x ≠ y and λ in (0,1) The function f is strongly convex with parameter α > 0 if for every x,y in C and λ in [0,1], f(λx + (1−λ)y) ≤ λ f(x) + (1−λ) f(y) − (α/2) λ(1−λ) ||x − y||^2.
Let f be differentiable on a convex set C Then f is convex if and only if for all x, y ∈ C, ⟨∇f(y), y−x⟩ ≥ f(y) − f(x) ≥ ⟨∇f(x), y−x⟩; f is strictly convex if these inequalities are strict whenever x ≠ y If f is twice differentiable, convexity is equivalent to the Hessian ∇^2f(x) being positive semidefinite for every x ∈ C, and f is α-strongly convex if and only if every eigenvalue of ∇^2f(x) is at least α for all x The convex conjugate f*, defined on dom f by f*(y) = sup_{x ∈ dom f} {⟨x, y⟩ − f(x)}, is convex, and its effective domain is dom f* = {y : f*(y) < ∞}. -**Support Pollinations.AI:** -🌸 **Ad** 🌸Powered by Pollinations.AI free text APIs [Support our mission](https://pollinations.ai/redirect/kofi) to keep AI accessible for everyone.
Miscellaneous notation For x,y ∈R N , we denote [x,y] for the line segment be- tweenxand y, namely, {ax+ (1−a)y: 0≤a≤1} We use the notation conv(A) for
Seminorms
the convex hull of a set A⊆R N , that is, conv(A) = {Pk i=1λ i x i :x i ∈A, λ i ≥0, i1, , k,Pk i=1λ i = 1}.
A seminorm onR N is a functionk ã k:R N →Rwith the following properties:
• Positive homogeneity: for everya∈R,x∈R N ,kaxk=|a|kxk.
• Triangle inequality: for everyx,x ′ ∈R N ,kx+x ′ k ≤ kxk+kx ′ k.
Clearly, every norm is a seminorm: kxk ≥ 0 for all x and k0k = 0 However, a seminorm does not force kxk = 0 to imply x = 0, so nonzero vectors can have positive seminorm We will skip the trivial all-zero seminorm Consequently, there exists some x with kxk > 0, and by homogeneity, for every a > 0 there is a vector v with kvk = a.
Introduction
Relation to Regularized Follow the Leader
The class of concave-potential algorithms includes the important family of linear-cost Regularized Follow the Leader (RFTL) methods The linear forecaster RFTL(η, R) updates its weights via x_{t+1} = g(L_t) = argmin_{x ∈ K} { ⟨x, L_t⟩ + R(x)/η } Following [47], we assume R is strongly convex and twice continuously differentiable A forthcoming theorem shows that linear RFTL is a concave-potential algorithm, with a potential function directly related to the convex conjugate of the regularizing function These properties are well known (e.g., Lemma 15 in [81]), and a calculus-based proof is provided in Section 3.5 for completeness.
Theorem 3.4 IfR:K →Ris continuous and strongly convex andη >0, thenΦ(L) (−1/η)R ∗ (−ηL) is concave and continuously differentiable on R N , and for every L∈
R N , it holds that ∇Φ(L) = arg min x ∈K {xãL+R(x)/η} and Φ(L) = min x ∈K {xãL+
It is now possible to lower bound the regret of RF T L(η,R) by applying the lower bounds of Corollary3.2 and Theorem 3.3.
Theorem 3.5 The regret of RF T L(η,R) satisfies
R RF T L(η, R ),T ≥ 1 η(R(xT +1)− R(x1)) + xT +1ãLT −min u∈K{uãLT}
3.2 NON-NEGATIVE INDIVIDUAL SEQUENCE REGRET 41 and ifR ∗ is continuously twice-differentiable on conv({−ηL 0 , ,−ηL T }), then
Proof Let Φ be the potential function of RF T L(η,R) according to Theorem 3.4 We have that Φ(Lt) =xt+1ãLt+R(xt+1)/η and xt+1=∇Φ(Lt) for everyt Therefore, Φ(L T )−Φ(L 0 ) =x T +1 ãL T + 1 ηR(x T +1 )−x 1 ãL 0 −1 ηR(x 1 )
= 1 η(R(x T +1 )− R(x 1 )) +x T +1 ãL T , where we used the fact thatL 0 =0 Therefore,
≥0, where the first inequality is by Corollary3.2, and the second inequality is by the fact thatx1= arg minu ∈K{R(u)}andx T +1 ∈ K This concludes the first part of the proof.
By Theorem 3.4, Φ(L) = −(1/η)R ∗ (−ηL) Thus, Φ is continuously twice- differentiable on conv({L0, ,L T }), and we have ∇ 2 Φ(L) =−η∇ 2 R ∗ (−ηL) By the first part and by Theorem 3.3, for somez t ∈[L t − 1 ,L t ],
Note that the first order regret term is split into two new non-negative terms, namely, (R(x T +1 )− R(x1))/η, and x T +1 ãL T −minu ∈K{uãL T }.
The first-order regret lower bound described in Theorem 3.5 can be established directly by extending the Follow the Leader, Be the Leader (FTL-BTL) lemma This extension draws on the classical FTL-BTL framework documented in [59], with additional perspective from [47], and culminates in the required bound The extended lemma and its full proof are provided in Subsection 3.5.1.