# Value-sensitive exploration in multi-armed bandits: Utility to SMS routing

## What the research is:

Many companies, including Facebook, send text messages to their users for phone number verification, two-factor authentication, and notifications. To deliver these SMS messages to their users, companies generally use aggregators (e.g. Twilio) that have contracts with operators around the world. These aggregators are responsible for delivering the SMS message to the users and offer different quality and cost attributes. Quality in this context is the probability that a message will be successfully delivered to a user. An important decision for the company is determining the best aggregator to use to route these messages. A significant challenge here, however, is the non-stationarity of the quality of the aggregators, which changes significantly over time. This requires a balanced exploration-exploitation approach where we need to learn the best aggregator at any given point in time and maximize the number of messages we pass through them. Multi-armed bandits (MAB) are a natural setting in which to articulate this problem. However, the existing multi-armed bandit literature mainly focuses on optimizing a single objective function and cannot simply be generalized to the environment in which we have multiple metrics such as quality and cost.

Motivated by the above problem, in this article we propose a novel variant of the MAB problem that takes into account the costs associated with playing an arm and introduces new metrics that uniquely capture the functions of several real-world applications. We argue over the severity of this problem by setting fundamental limits on how an online algorithm can perform. We also show that the naive generalization of existing algorithms works poorly from a theoretical and empirical point of view. Finally, we propose a simple algorithm that balances two asymmetric goals and achieves near optimal performance.

## How it works:

In a traditional (stochastic) multi-armed bandit problem, the learning agent has access to a series of K actions (weapons) with unknown but fixed reward distributions and must repeatedly select an arm to maximize the cumulative reward. The challenge here is to develop a policy that balances the tension between acquiring information about actions with little historical observation and using the most rewarding arm based on existing information. The Regret metric is typically used to measure the effectiveness of such an adaptive policy. In short, regret measures the cumulative difference between the expected reward from the best action if we had known the actual reward distributions with the expected reward from the policy preferred action. The existing literature has examined this attitude in depth and has resulted in simple but extremely effective algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS), which have been further generalized and applied to a variety of application areas. The central focus of these algorithms is on incentives for sufficient research into promising actions. In particular, these approaches ensure that the best action always has a chance to get out of situations in which the expected reward function is underestimated due to unfavorable randomness. However, there is also a fundamental limitation on the performance of an online learning algorithm in general settings. In situations where the number of decision epochs is few and / or there are many actions with reward distributions similar to the optimal action, it becomes difficult to learn the optimal action effectively. Mathematically speaking, the traditional problem found that any online learning algorithm must have a regret of Ω (KT), where K is the number of arms and T is the number of decision epochs.

In this thesis we generalize the MAB framework to the multi-objective setting. Specifically, we consider the issue where the learning agent has reconciled the traditional trade-offs between exploration and exploitation and the multi-goal trade-offs between reward and cost. Considering the SMS application to manage the costs, the agent may be agnostic between actions whose expected reward (quality) is greater than 1 – α fraction of the highest expected reward (quality). We call α the subsidy factor and assume that it is a known parameter given by the problem domain. The agent’s aim is to study various actions and identify the cheapest arm among these high-quality weapons as often as possible. To measure the performance of a policy, we define two terms: regret, quality compliance and cost compliance. Quality compliance is defined as the cumulative difference between the α-adjusted expected reward from the highest quality action and the expected reward from the policy preferred action. Similarly, cost compliance is defined as the cumulative difference between the cost of the measure preferred by our policies and the cost of the cheapest viable weapons, if the characteristics and cost were known in advance.

For this problem we show that a naive extension of existing algorithms like TS has a bad effect on cost compliance. In particular, we consider the variant of TS, in which we determine the realizable quantity on the basis of the corresponding quality estimates and select the cheapest realizable measure. For this variant we can show that TS arbitrarily performs worse (ie causes linear regret). This is primarily because existing algorithms incentivize research into promising promotions and can result in high cost compliance in environments where there are two promotions with similar rewards but very different costs. We then set a fundamental lower bound of Ω (K1 / 3T2 / 3) for the performance of any online learning algorithm for this problem and highlight the severity of our problem compared to the classic MAB problem. Building on this knowledge, we develop a simple algorithm based on the idea of exploring and committing, which balances the tension between two asymmetrical goals and achieves near optimal performance down to logarithmic factors. We also demonstrate the superior performance of our algorithm through numerical simulations:

## Why it matters:

Multi-Armed Bandits (MAB) is the most popular paradigm for balancing the trade-off between exploration and exploitation that is essential for online decision-making under uncertainty. They are applied to a wide variety of uses including drug studies and online experimentation to ensure that the most promising candidate is offered a maximum number of samples. Similar tradeoffs are typical of recommendation systems, where the possible recommendation options and user preferences are constantly evolving. Facebook has also used the MAB framework to improve various products, including determining the ideal video bandwidth allocation for users and the best aggregator for sending authentication messages. While this can be modeled in the traditional MAB framework by considering a cost subtracted from the reward as a modified target, such a modification does not always make sense, especially in environments where the reward and cost associated with an action represent different amounts ( e.g. quality) and costs of an aggregator). With such problems, it is natural for the learner to optimize both metrics, usually avoiding the exorbitant cost of a small increase in the cumulative reward. To the best of our knowledge, this paper takes a first step in generalizing the multi-armed bandit framework to problems with two metrics, presenting both basic theoretical performance limits and easy-to-implement algorithms to compensate for the multi-objective tradeoffs.

Finally, we do extensive simulations to understand different regimes of problem parameters and to compare different algorithms. In particular, we consider scenarios where naive generalizations of UCB and TS, customized in real world implementations, perform well and settings where they perform poorly, which is of interest to practitioners.

## Read the full paper:

Multi-armed bandits with cost subsidies

Comments are closed.