Some Notes on Dueling Bandits

union

The dueling bandit problem natrually fits the description of a variety of recommendation systems that require ‘’learning on the fly’’, yet have no explicit access to a ‘‘reward’’ model. Instead, the ‘‘human’’ feedback part takes the form of ‘‘choices’’, ‘‘votes’’, or some discrete forms. Hence, oftentimes a learning protocol proceeds at time steps \(t = 1, \ldots, T\):

  • The algorithm chooses a pair of arms \( a_i, a_j \) from \(K\) available ones;
  • The oracle/human feedback/nature reveals the winner arm \( a_i \), with probability \( P(a_i \succ a_j)\), \( P(a_i \prec a_j) = 1 - P(a_i \succ a_j) \)

So the feedback is either \(a_i\) or \( a_j \), the preferences forms a matrix \(P \in \mathbb{R}^{K \times K}\) such that \( P + P^{\top} = I\) which defines the hidden information of the dueling bandit problem. The cumulative regret in the stochastic dueling bandit setting is: $$ \mathcal{R}_T = \sum_{t=1}^T P( a^* \succ a^t_i ) + P( a^* \succ a^t_j ) $$ where \(a^* \) is usually the Condorcet winner, i.e., \( P( a^* \succ a_j) > \frac{1}{2} \ \ \forall j \in [K], a_j \neq a^*\). Condorcet winner is a pretty straightforward idea, which might not exist in general cases. To see that, suppose there are three candidates: A, B, and C, and three voters with the following preferences:

  1. Voter 1: A > B > C
  2. Voter 2: B > C > A
  3. Voter 3: C > A > B
Voter 1 Voter 2 Voter 3
A > B X X
B > C X X
C > A X X

In this example, let’s check the pairwise comparisons:

  1. A vs. B: B is preferred by Voter 2, but A is preferred by Voter 1 and 3.
  2. B vs. C: B is preferred by Voter 1 and 2, but C is preferred by Voter 3.
  3. C vs. A: C is preferred by Voter 2 and 3, but A is preferred by Voter 1.

Since no candidate consistently beats all others in pairwise comparisons, there is no Condorcet winner in this example. Now we transform this example to fit in the dueling bandits context, we can have, for instance, a cyclic relation \(A \succ B \succ C \succ A \).

A B C
A 0.5 0.7 0.2
B 0.3 0.5 0.6
C 0.8 0.4 0.5

Apparently the existence of a Condorcet winner requires a row that is greater than \(0\). While in a lot of cases such a winner/solution concept might not be suitable, let’s first dive into the algorithmic design of trying to find it when it exists. There are basically two styles of algorithms, asymmetric or symmetric:

The asymmetric style conceptually separates two choices into choosing a reference arm and an exploration arm. The reference arm acts as a summary of historical pulls. Typical algorithms of this type includes IF, BtM, SAVAGE, Doubler, RUCB, MergeRUCB, RCS, and DTS. The exploration strategy is to maximize the efficiency of identifying the best arm.

Interleaved Filter (IF) and Beat the Mean (BtM)

The very first two methods proposed are Interleaved Filter (IF) 1 and Beat the Mean (BtM) 2. They all assume there is a total ordering of the arms, i.e., we can relabel the arms as \(a_1, \ldots, a_K\), such that \( p_{i,j} > 0.5 \) for all \( i < j\). Under this assumption the Condorcet winner is \(a_1\).

IF method basically, is a type of “hill climbing” method that iteratively updates the reference arm \(\hat{a}\) by comparing it with other arms until it becomes the most possible Condorcet winner. The term “interleaved” is kind of originating from a kind of Netflix ranking acceleration technique, which uses a blend of ranker to recommend videos to the same group of users, then compare the share of viewing hours coming from different rankers. IF picks a reference arm \(\hat{a}\) randomly and preserves a set of “Condorcet candidates” to compare with the reference arm, as well as a set of confidence intervals \( \hat{C}_t := [ P_{\hat{a}, a} - c_t, P_{\hat{a}, a} + c_t] \), where \(c_{t}=\sqrt{\log (1 / \delta) / t}\), then it iteratively compares all of them and gradually filters out dominated arms which are out of the confidence intervals, i.e., \( P_{\hat{a}, a} - c_t > \frac{1}{2} \) and also updates reference arms, i.e., \(P_{\hat{a}, a} + c_t < \frac{1}{2}\). The corner stone idea is to prove after logarithmic comparisons, the winner between any two pairs is identified as the winner “correctly” with probability \(1 - \delta \). To see this, let’s say \(n\) is the number of comparisons, for \(t \in \mathbb{N}_+\), the event \( \mathcal{E}_t\) is when \( \hat{P} - c_t < \frac{1}{2} \), which is the condition for the match to continue after \(t\) comparisons, therefore $$ Pr( n \geq t) \leq P( \mathcal{E}_t ), $$

the confidence interval boundaries \( p_{i,j} \notin \hat{C}_t \) $$ \begin{align*} Pr(\mathcal{E}_t) & = Pr( \hat{P}_t - p_{i,j} \leq c_t - \Delta_{i,j} ) \\ & = Pr( \mathbb{E}[\hat{P}_t] - \hat{P}_t > \Delta_{i,j} - c_t) \\ & \leq Pr( |\mathbb{E}[\hat{P}_t] - \hat{P}_t| > \Delta_{i,j}/2 ) \\ &\leq 2 \exp \left(-t \epsilon_{i, j}^{2} / 2\right) \\ &\leq 2 \exp \left(-m \log \left(T K^{2}\right)\right) \\ & =2 /\left(T K^{2}\right)^{m} , \end{align*} $$ taking \( m = \max{ 4, d }\), we have \( Pr( n \geq t) \leq K^{-d }\) since we have \( c_t \leq \Delta_{i,j}/2\) when \(t = \left\lceil m \log \left(T K^{2}\right) / \epsilon_{i, j}^{2}\right\rceil\) and \( m > 4\).

BtM leverages the concept of Borda score: $$ b(a_i) = \frac{1}{K}\sum_{j} p_{i, j} $$ and two facts:

  • the Condorcet winner cannot be a Borda loser, in the sense that the average Borda score must be greater than \(0.5\);
  • the Condorcet winner stays if some other arms are “removed”.

Therefore, as long as we keep eliminate Borda losers, nothing will be left except for the Condorcet winner.

The problem with IF is that the theoretical guarantee requires some sorts of Strong Stochastic Transitivity (SST) property: for any triple \( (i, j, k)\), \( \Delta_{i,k } \geq \max \{ \Delta_{1, j}, \Delta_{1,k} \}\). BtM only requires a relaxed SST property: there exists \(\gamma \geq 1\) such that for all pairs \( (j,k)\) with \( 1 < j < k\), we have \( \gamma \Delta_{1,k} \geq \max\{ \Delta_{1, j}, \Delta_{1,k} \}\). \(\gamma\) measures the hardness of the problem, as the smaller the gap becomes, the harder it is to identify the better arm while dueling.

The following regret bounds have been proven already, given \(\Delta_{\min}\) and \( \gamma\), $$ \begin{align*} \mathbb{E}[ \mathcal{R}_T^{IF} ] & \leq \mathcal{O}\left( \frac{K \log T}{\Delta_{\min} }\right), \\ \mathcal{R}_T^{BtM} & \leq \mathcal{O}\left( \frac{\gamma^7 K \log T}{\Delta_{\min }}\right) \quad \text{ with high probability}. \end{align*} $$

Sensitivity Analysis of VAriables of Generic Exploration (SAVAGE) and Doubler

SAVAGE works in a way similar to how BtM works: if we know there is a Condorcet winner, any arms that lose with high probability can be safely eliminated from further consideration. So we can compare arms in a round robin (all-to-all) fashion and drop the pairs of arms as long as it’s “safe” to do so. Its regret bound is of order \( \mathcal{O}( K^2 \log T)\), which is not tight, however, it empirically outperforms IF and BtM by a wide margin when the arm size is moderate.

Doubler converts the dueling bandits into conventional multi-armed bandit problems, under the assumptions that the preferences are linear choice functions of underlying utilities associated with the arms. In other words, \( \Delta_{A,B} = (\mu_A - \mu_B)/2\), \(\mu_A\) is the mean utility of arm \(A\), for example. Doubler proceeds in epochs of exponentially increasing size, (called “doubling trick”). In each epoch, the left arm is sampled from a fixed distribution, the right arm is chosen from a ordinary bandit algorithm, minimizing the regret against the left arm.

There is also a bunch of algorithms based on UCB (Upper Confidence Bound) variants. The fundamental principle for UCB type of algorithms is one can always create a confidence interval for the interesetd statistic estimates depending on how the sampling procedure is going on. (People write tons of tons of paper about different procedures while essentially they are the variants for the same thing, and then there will be someone trying to unifying the different frameworks. Always interesting to keep this thread going on.)

RMED and Sparring

The Relative Minimum Empirical Divergence (RMED) algorithm has been proved to have optimal asymptotic regret. First of all, the authors from 3 constructed some nuanced lower bound example, (I browsed their paper, a little bit hard to comprehend…) to claim that the lower bound for dueling bandits problem is characterized by $$ \liminf_{T \rightarrow \infty} \frac{\mathbb{E}[R(T)]}{\log T} \geq \sum_{i \in[K] \backslash{1}} \min_{j \in \mathcal{O}_{i}} \frac{\Delta_{1, i}+\Delta_{1, j}}{2 d\left(\mu_{i, j}, 1 / 2\right)} $$ where the notation \( d(\mu_{i,j}, \frac{1}{2})\) is kind of characterzing how “easy” it is to tell which one is better, \(i\) or \(j\). The trick to match this lower bound is sort of like the plugin principle, you design an algorithm that directly uses the empirical divergence as a criteria to pick the Condorcet winner.

The Sparring series belongs to the symmetric algorithms, it is inspired by that the dueling bandits problem is just an example of a symmetric game. symmetric game is one where the payoffs for playing a particular strategy depend only on the other strategies employed, not on who is playing them. That is, if any two players were to switch their strategies with one another, their payoffs would switch as well, leaving the overall outcome of the game unchanged. This is intuitively understandable because you can always view the dueling recommendations as actions played by two players and the eventual goal is to come up with a “draw” that no one wants to deviate.

Therefore, any no-regret dynamics would lead to the empirical convergence to the equilibrium of such a game, oftentimes the regret has the form of \( \mathcal{O}(\sqrt{T}) \), I’m going to write a post to discuss the adversarial bandit problems later to talk about my understandings. What was surprising is that empirical experiments show that adversarial algorithms seem to have a logaritmic regret rate despite the proven bound.

In general, given an algorithm \( \mathcal{A} \) that solves the adversarial bandit problem, we can use it to sovle the dueling bandits problem by placing a row player and a column player, both sparring with each other by giving out his bet of best action, one gets reward \(1\) if he wins, otherwise \(0\). As the comparisons have been carried out, the player uses \(\mathcal{A}\) to update their strategies. In this fashion, we are sort of obtaining a sampling estimate of the preference matrix, or at least getting some information of this preference matrix through sparring, it is just this sampling procedure is symmetric.

A bit thoughts

Dueling bandits, as a variant of the multi-armed bandit problem has its relevance in that the preference feedback nowadays are way more easier to obtain. It sort of gave rises to the concept of Reinforcement Learning with Human Feedback and I believe chatgpt has been significantly benefiting from that. I personally think the whole RLHF thing don’t really quite need a theoretical fundation unless it is really instructive. Entertainment-wise, I love the advancement of dueling bandits in that it gives a theoretical formalism for how we can deal with discrete type of data. We’ll see where it leads us to.


  1. Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5), 1538-1556. ↩︎

  2. Yue, Y., & Joachims, T. (2011). Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 241-248). ↩︎

  3. Komiyama, J., Honda, J., Kashima, H. and Nakagawa, H., 2015, June. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory (pp. 1141-1154). PMLR. ↩︎