Strategies to Identify the Best Dish: Error Bounds for Best Arm Identification in Multi-Armed Bandits Billy Fang Princeton University Abstract In the multi-armed bandit model, an agent is presented with a number of hidden distributions and in each turn chooses a distribution and observes a sample from it. This paper focuses on a particular setup called the best arm identification problem, in which the agent has a fixed number of turns to sample from the distributions in order to determine which distribution has the highest mean. We present and discuss upper bounds and lower bounds for the probability of error. Introduction Suppose you are planning to go to the same restaurant for lunch every day this month, and each day you order one of the dishes on the menu. The tastiness of each dish follows a hidden probability distribution, and each time you eat a dish, the tastiness you experience is drawn from that dish’s underlying distribution. Each day you are free to choose whichever dish you want, and can base your decision upon your past experience. At the end of the month, you are asked to use your experience to identify the dish with the highest expected tastiness. This framework is a variant of the multi-armed bandit model. The name of the model is based on the phrase “one-armed bandit,” which is slang for a slot machine; one could reformulate the above example in terms of a set of slot machines (which we call arms) which return rewards according to hidden distributions. The defining characteristic of the multi-armed bandit problem is the trade-off between “exploration” and “exploitation.” If the agent explores all the arms too much in order to gain knowledge about their underlying distributions, he risks incurring bad rewards. On the other hand, if he exploits what he believes to be the best arms in order to get high rewards, he risks missing a greater reward due to his limited knowledge about the other arms. The problem of identifying the best arm after a fixed number of trials, described above, is known as the best arm identification problem. A different well-studied problem (which we will not consider) is that of maximizing the cumulative reward. The main difference between these two problems is that in the latter the performance metric (cumulative reward) is tied to the rewards incurred during the decision-making process, while in the former the performance metric (the final identification) is separated from the “testing phase.” For example, consider clinical trials, in which possibly negative effects on each patients must be minimized, compared to cosmetic trials, in which poor results during the isolated testing phase are of little consequence and only help with the final decision of which product to place on the market. The multi-armed bandit model and the best arm identification problem can model a host of learning problems, such as selecting the best communication channel from a set of noisy channels, or a general reinforcement learning scenario where an agent can observe the random rewards of different actions in a trial period in order to decide which action to ultimately select. In this paper we will provide bounds for the probability of error in the best arms identification problem. 11 1 Theorem 2.1. Principia: The Princeton Undergraduate Mathematics Journal Problem statement and notation Let ν1 � � � � � νK be K probability distributions with respective means µ1 � � � � � µK . Without loss of generality we will assume that the means are ordered and that there is a unique optimal mean. That is, µ1 > µ2 ≥ µ3 ≥ · · · ≥ µK � where ties are broken arbitrarily. We will assume that for each � ∈ {1� � � � K }, the distribution ν� is Gaussian with variance 1 and mean µ� in [−1� 1]. An agent, who has no knowledge of the µ� , is given a budget of � rounds. For each round � ∈ {1� � � � � �}, the agent pulls an arm I� ∈ {1� � � � � K }, and observes a reward drawn from the distribution νI� , independent from past actions and observations. After the � rounds, the agent returns an arm J� that he believes corresponds to the distribution with the highest mean. We define the gap of a suboptimal arm � �= 1 to be the difference between its mean and the optimal mean. ∆� := µ1 − µ� � For � = 1, we define ∆1 := ∆2 . In particular, we then have ∆1 = ∆2 ≤ ∆3 ≤ · · · ≤ ∆K ≤ 2� where the last inequality holds because the µ� lie in [−1� 1]. We let T� (�) denote the number of times arm � was chosen in rounds 1 to �, and we let X��1 � X��2 � � � � � X��T� (�) be the corresponding sequence of rewards observed from choosing arm �. The empirical mean of arm � after � pulls is denoted by � 1� � µ��� := X��� � � �=1 To assess the success of the agent’s choice J� , we define the probability of error by �� := P(J� �= 1)� We would like to find policies (i.e., strategies) for the agent that ensures small probability of error. We will only consider deterministic policies, that is, under a given policy and given past observations, the agent’s choice of action is fixed. The bounds that we find for the probability of error depend on the particular multi-armed bandit. Intuitively, if the suboptimal arms have means close to the best mean (i.e., the ∆� are small), it is difficult to distinguish the best arm from the others; likewise, if the ∆� are large, the task is easier. We define two measures of hardness that capture this notion. K � 1 H1 := ∆2 �=1 � and H2 := max � � � ∆�2 These two quantities are actually equivalent up to a logarithmic factor (Audibert and Bubeck, 2010). H2 ≤ H1 ≤ log(K )H2 ≤ log(2K )H2 � We will later see that these quantities are indeed appropriate measures of hardness. 12 2 2.1 Principia: The Princeton Undergraduate Mathematics Journal Upper bounds for best arm identification Uniform allocation We begin by presenting a naïve policy known as uniform allocation, which simply pulls each arm �/K times and returns the arm with the highest empirical mean. Under the uniform allocation policy, the probability of error in the best arm identification problem satisfies � � � � K � �∆22 �∆�2 �� ≤ exp − ≤ K exp − � 2K 2K �=2 Proof. If J� �= 1, then the empirical mean of arm J� was higher than that of arm 1. Therefore, {J� �= 1} ⊂ K � �=2 {� µ1��/K ≤ � µ���/K }� Applying a union bound and a Hoeffding-type bound (Theorem A.1) produces the stated inequality. �� := P(J� �= 1) ≤ = K � �=2 K � �=2 P(� µ1��/K ≤ � µ���/K ) P(� µ���/K − � µ1��/K − (µ� − µ1 ) ≥ ∆� ) � �∆�2 ≤ exp − 2K �=2 � � �∆22 ≤ K exp − � 2K K � � (union bound) (Theorem A.1) We can rephrase the result of this theorem as follows. To ensure that �� < δ for some δ > 0, we require the budget � to satisfy � � 2K K � ≥ 2 log � δ ∆2 The fact that this bound only depends on ∆2 is unsatisfying, since it does not take into account how suboptimal the other arms are. For instance this policy has the same type of performance on the following two sets of arm means {1� 0�9� 0�8} and {1� 0�9� −1} because ∆2 = 0�1 in both cases, even though in the latter case it is much easier to distinguish the worst arm from the rest. This shortcoming is due to uniform allocation’s inability to adapt its strategy to the observed rewards. 2.2 Successive Rejects An issue with the uniform allocation strategy is that it does not adapt its behavior upon observing the rewards and only checks them after exhausting the budget. If there is an arm that is extremely suboptimal, 13 Principia: The Princeton Undergraduate Mathematics Journal this strategy will waste turns pulling this arm even after observing that it returns very suboptimal rewards. The following algorithm addresses this issue. In the Successive Rejects (SR) algorithm (Audibert and Bubeck, 2010), the budget of � rounds is divided into K − 1 phases. The agent keeps track of an “active set” of arms that initially contains all the arms. In each phase, he tries all the arms in the active set equally often, and eliminates the arm with the lowest empirical mean. By the end of the last phase, he will have eliminated all but one arm, which he will return as J� . The procedure is intuitive, since the agent needs to spend more time trying the arms that are closer to the optimal arm in order to properly determine which one is the best. The lengths of the phases are chosen carefully to ensure a good bound. In the notation below, �� denotes the number of times the �th eliminated arm is pulled. Note that �K −1 = �K (the two arms that stay in the active set through all the rounds are pulled equally often). Moreover, in the �th phase, each of the active arms is pulled �� − ��−1 times, so the length of the �th phase is (�� − ��−1 )(K + 1 − �), where we have defined �0 = 0. Below, A� denotes the active set during phase �. Successive Rejects algorithm Let A1 := {1� � � � � K }, log(K ) = 1 2 + �K �� := �∗ In each phase � = 1� � � � � K − 1, 1 �=2 � , �0 = 0. For � ∈ {1� � � � � K − 1}, let 1 �−K � K log(K ) + 1 − � – Pull each active arm � ∈ A� for �� − ��−1 rounds. – Let A�+1 be the result of removing arg min�∈A� � µ���� from A� (ties broken arbitrarily). Let JK be the unique element of AK , and return it. Note that that the number of rounds does not exceed the budget. � � K −1 �−K 1 � 1 �1 + �2 + · · · + �K −1 + �K −1 ≤ K + + = �� K +1−� log(K ) 2 �=1 Theorem 2.2. Under the Successive Rejects algorithm, the probability of error in the best arm identification problem satisfies � � K (K − 1) �−K �� ≤ exp − � 2 2log(K )H2 Proof. At the beginning of the �th phase, we will have already eliminated � − 1 arms, so at least one of the worst � arms will still be in the active set. Therefore, if the optimal arm is eliminated at the end of the �th phase, then we must have � µ1��� = min � µ���� ≤ �∈A� max �∈{K �K −1�����K +1−�} � µ���� � If we let E� denote the event that arm 1 was eliminated in phase �, then by what we have just shown, E� ⊂ K � �=K +1−� {� µ1��� ≤ � µ���� }� 14 Therefore, Principia: The Princeton Undergraduate Mathematics Journal �� := P(AK �= {1}) ≤ ≤ = ≤ ≤ K −1 � �=1 K −1 � P(E� ) K � �=1 �=K +1−� K −1 � K � �=1 �=K +1−� K −1 � K � �=1 �=K +1−� K −1 � �=1 (union bound) P(� µ1��� ≤ � µ���� ) P(� µ���� − � µ1��� − (µ� − µ1 ) ≥ ∆� ) exp(−�� ∆�2 /2) � exp(−�� ∆K2 +1−� /2)� (union bound) (Theorem A.1) To conclude the proof, note that by the definition of �2 and H2 , we have �� ∆K2 +1−� ≥ 1 �−K 1 �−K · −2 ≥ � K + 1 − � ∆K +1−� log(K ) log(K )H2 We can again rephrase the result of this theorem as follows. To ensure �� < δ for some δ > 0, we require the budget � to satisfy � � K (K − 1) � ≥ 2H1 log + K� 2δ where we have used H1 ≤ log(K )H2 . If we compare this with the analogous result for uniform allocation, we see that H1 has replaced K /∆22 . But since ∆2 ≤ ∆� for all � ∈ {1� � � � � K }, we have H1 ≤ K /∆22 . This essentially captures the reason why the SR algorithm performs better than uniform allocation. For all bandits that have the same gap ∆2 , uniform allocation has the same performance, regardless of whether arms � ≥ 3 are extremely suboptimal or extremely close to optimal; however, SR takes into account all gaps, and performs differently depending on the value of H1 . Note that we have equality H1 = K /∆22 when ∆2 = ∆3 = · · · = ∆K , which is precisely when SR does not have the advantage of rejecting suboptimal arms. 3 Lower bounds for best arm identification We would like to find a lower bound on the probability of error for a given bandit under all policies, or in other words, a “ceiling” on the performance of any policy on a given bandit. To do this, we consider an “oracle” agent that is given extra information about the bandit, and therefore will perform better than a normal agent. Any lower bound on the probability of error for the oracle agent will therefore be a lower bound for that of a normal agent. Our motivation for finding a lower bound is twofold. First, whereas constructing good algorithms/policies give upper bounds for the probability of error, a lower bound provides a benchmark against which upper bounds can be compared; note that without a lower bound we currently have no context to evaluate the 15 Principia: The Princeton Undergraduate Mathematics Journal performance of the SR algorithm. Second, if we do find a lower bound that is close to the upper bound given by the SR algorithm, we will have shown that H1 and H2 are indeed “correct” measures of hardness. Note that our definition of lower bound is currently meaningless, since it is possible that �� = 0. For example, if an agent is presented with the arms (unordered), and he follows the policy that simply identifies the third arm as best regardless of the observations, then �� = 0 for any bandit in which the third arm is indeed the best arm. However, this policy is clearly not “best,” since it has probability of error equal to 1 on any other bandit. We need to redefine the notion of lower bound in a way that is more meaningful. A natural way to resolve this issue is to introduce an adversarial aspect. Given a bandit, we reveal all the arm distributions to the agent. Then, given the policy that the agent chooses, the adversary considers all bandits whose arms are a permutation of the original bandit’s arms, and chooses the permutation that maximizes the policy’s probability of error. We seek a lower bound for this maximum, over all policies. This avoids the issue described above because if the policy were to always identify the third arm as best, the adversary would simply choose any permutation of the original bandit such that the third arm is not the best, producing a probability of error equal to 1. Moreover, it is reasonable to assume that a good policy performs similarly on any permutation of the arms (note that permutations do not change the hardness measures H1 or H2 ). The resulting lower bound, due to Audibert and Bubeck (2010), is comparable to the upper bound in Theorem 2.2. Unfortunately, the proof of this bound is rather lengthy and involved, so we instead present an alternate approach (Bubeck, private communication). Instead of considering all permutations of the arms, the adversary considers certain “translations” of the arms, and returns the one that maximizes the given policy’s probability of error. The resulting lower bound is also comparable to the upper bound in Theorem 2.2, and the proof is much shorter than the previous one. What we possibly sacrifice is that the class of translations is “farther away” from the original bandit than the class of permutations of a bandit. We will elaborate on this below. For simplicity we will assume that the arm distributions are Gaussian. 3.1 Lower bound We change � our definitions of the gaps and hardness measures slightly: we redefine ∆1 := 0 and redefine H1 := K�=2 (∆� )−2 (the redefined H1 is equivalent to the original one by a factor of (K − 1)/K in the worst case, so the change is small for large K ). We still assume that there is a unique best arm, i.e., ∆2 > 0. Let ν := ν1 ⊗ · · · ⊗ νK denote the bandit with arm distributions ν1 � ν2 � � � � � νK . We define a translation operation τ� on ν, for each � ∈ {1� � � � � K }. � ν� + 2∆� if � = �, (τ� ν)� := ν� if � �= �, where ν� + 2∆� denotes translating the support of the distribution by +2∆� . We see that τ� simply fixes everything except arm �, whose mean is translated above that of the old best arm, making it the new best arm. Moreover, τ1 is the identity. Let τ� ν denote the bandit obtained by performing the translation τ� on the arms of the original bandit, and let �� (τ� ν) and H1 (τ� ν) be the probability of error and hardness on τ� ν respectively. Since each τ� increases the size of the gaps, we have H1 (τ� ν) ≤ H1 (ν). This is intuitive, because translating an arm far above all the others will make the new problem easier. We will describe the importance of this property after presenting the proof. 16 Principia: The Princeton Undergraduate Mathematics Journal Theorem 3.1. Let ν be a bandit whose arm distributions are ν� ∼ � (µ� � 1) with µ� ∈ [−1� 1], for � ∈ {1� � � � � K }. Given any policy, we have � � 1 2� max �� (τ� ν) ≥ exp − � � 4 H1 (ν) The proof of this theorem uses results involving Kullback-Leibler divergence, which, loosely speaking, is a type of distance between two probability distributions. We defer these results to the appendix of ?. Proof. Recall that T� (�) is the number of times arm � was pulled. Choose � �= 1 such that Eν [T� (�)] ≤ � ∆�2 H1 (ν) � Such an � exists because otherwise, the expected number of pulls of each arm � �= 1 exceeds �/(∆�2 H1 (ν)), which implies K K K � � � � 1 �= T� (�) ≥ Eν [T� (�)] > = �� H1 (ν) ∆�2 �=1 �=2 �=2 a contradiction. In plain words, we are choosing � to be the least-pulled arm (excluding arm 1), which is where the algorithm is more likely to make a mistake. We let �(Y� ) be the distribution of the �-dimensional vector Y� of observed rewards when the algorithm runs on the translated bandit τ� ν. Using the chain rule for Kullback-Leibler divergence (see appendix of ?), we can calculate the divergence between the reward distribution of the original bandit (translation τ1 ) and that of this translated bandit (translation τ� for � �= 1). KL(�(Y1 )� �(Y� )) = 2∆�2 Eν T� (�) ≤ 2� � H1 (ν) where the last step is due to our choice of �. We are now equipped to finish the proof. In a translated bandit τ� ν where � �= 1, one way the agent could be incorrect is if he chooses J� = 1, because arm 1 is no longer the best. 2 · max �� (τ� ν) ≥ 2 · max{�� (ν)� �� (τ� ν)} � ≥ �� (ν) + �� (τ� ν) ≥ Pν (J� �= 1) + Pτ� ν (J� = 1) 1 ≥ exp(− KL(�(Y1 )� �(Y� ))) 2 � � 1 2� ≥ exp − � 2 H1 (ν) (Lemma A.2) As mentioned earlier, we know H1 (τ� ν) ≤ H1 (ν). Therefore, the theorem implies that there exists � such that � � � � 1 2� 1 2� �� (τ� ν) ≥ exp − ≥ exp − � 4 H1 (ν) 4 H1 (τ� ν) So, the theorem gives a lower bound for a translated bandit in terms of the hardness of that translated bandit. If the translation did not decrease H1 , we would not be able to arrive at such a conclusion. 17 Principia: The Princeton Undergraduate Mathematics Journal So, the theorem gives a lower bound for a translated bandit in terms of the hardness of that translated bandit. If the translation did not decrease H1 , we would not be able to arrive at such a conclusion. As mentioned before, this lower bound is comparable to the upper bound given by the SR algorithm, which suggests that H1 and H2 are appropriate measures of hardness. The proof of this bound is considerably shorter than that of the lower bound by Audibert and Bubeck (2010). Although both proofs involve a perturbation and controlling of the same quantities (number of pulls of an arm and the Kullback-Leibler divergence), the shorter proof only works with true KullbackLeibler divergences and expectations of the T� (�) while the longer proof delves into realizations of random variables and empirical estimates of the Kullback-Leibler divergence. Moreover, it is much easier to control the two quantities when dealing with translations rather than permutations, since only one arm is affected. What is possibly lost is that the class of translates is “farther away” from the original bandit than the class of permutations. Whereas it is easy to reason that a good policy should perform similarly on any permutation of a given bandit, it is in some sense harder to justify why a good policy should perform similarly on any translate of a given bandit. Conclusions In this paper, we introduced the best arm identification problem for multi-armed bandits. We described the Successive Rejects algorithm and gave an upper bound on its probability of error on the best arm identification problem. We also compared two approaches to finding lower bounds for the best arm identification problem. Moreover, we noted that the resulting lower bounds are close to the upper bound, suggesting that H1 and H2 are good measures of the complexity of the bandit. A discussion of how to generalize these results to the more general �-best arms identification problem, where the goal is to identify the best � arms rather than the best arm, can be found in Fang (2014). A further extension of �-best arms identification is combinatorial identification, where the returned set of arms must satisfy some constraint; for example, suppose the weights of edges of a connected graph follow hidden distributions, and we want to identify the spanning tree with the highest expected weight. This area of research is currently open. References Jean-Yves Audibert and Sébastien Bubeck. Best Arm Identification in Multi-Armed Bandits. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010. Sébastien Bubeck. Private communication. Billy Fang. Error Bounds for Identification Problems in Multi-Armed Bandits. 2014. A Lemmas Proofs of these lemmas and other relevant background results can be found in the appendix of Fang (2014). 18 Principia: The Princeton Undergraduate Mathematics Journal Theorem A.1 (Hoeffding-type bound). If X1 � � � � � X� are independent sub-Gaussian random variables with means µ� and common parameter σ 2 , then for any � > 0, � � � � � 1� �� 2 P (X� − µ� ) ≥ � ≤ exp − 2 � � 2σ �=1 Lemma A.2. Let ρ0 and ρ1 be two probability distributions supported on some set �.Then for any measurable function ψ : � → {0� 1}, PX ∼ρ0 (ψ(X ) = 1) + PX ∼ρ1 (ψ(X ) = 0) ≥ 19 1 exp(− KL(ρ0 � ρ1 ))� 2
© Copyright 2025