`duelpy.algorithms`¶

Various algorithms to solve Preference-Based Multi-Armed Bandit Problems.

Submodules¶

Package Contents¶

Classes¶

`Algorithm`	Parent class of all the implemented PB-MAB algorithms.
`ApproximateProbability`	Implementation of the Approximate probability algorithm.
`BeatTheMeanBandit`	Implements the Beat the Mean Bandit algorithm.
`BeatTheMeanBanditPAC`	The PAC variant of the Beat the Mean Bandit algorithm.
`BinarySearchRanking`	Implementation of the Binary Search Ranking algorithm.
`BordaRanking`	Implementation of the Borda ranking algorithm.
`CopelandConfidenceBound`	Implement the Copeland Confidence Bound(CCB) algorithm.
`CwRmed`	Implement the CW-RMED algorithm.
`DoubleThompsonSampling`	Implementation of the Double Thompson Sampling algorithm.
`DoubleThompsonSamplingPlus`	Implementation of the extended version of D-TS.
`VerificationBasedCondorcet`	Condorcet variant of the verification based algorithm.
`InterleavedFiltering`	Implements the Interleaved Filtering algorithm.
`KLDivergenceBasedPAC`	Implement the KL-divergence based PAC algorithm.
`KnockoutTournament`	Implementation of the knockout tournament algorithm.
`MallowsMPI`	Implementation of the Mallows Most Preferred Item algorithm.
`MallowsMPR`	Implementation of Mallows Most Probable Ranking Algorithm.
`MergeRUCB`	Implementation of the Merge Relative Upper Confidence Bound algorithm.
`Multisort`	Implements the Multisort algorithm.
`OptMax`	Implement the OptMax algorithm.
`PlackettLuceAMPR`	Implementation of the Plackett-Luce Approximate Most Probable Ranking algorithm, which computes a ranking over the arms.
`PlackettLucePACItem`	Implementation of the Plackett-Luce PAC-Item algorithm.
`RelativeConfidenceSampling`	Implementation of the Relative Confidence Sampling algorithm.
`RelativeUCB`	Implementation of the Relative Upper Confidence Bound algorithm.
`Rmed1`	Implementation of Relative Minimum Empirical Divergence 1 algorithm.
`Rmed2`	Implementation of Relative Minimum Empirical Divergence 2 algorithm.
`Rmed2FH`	Implementation of Relative Minimum Empirical Divergence 2 Fixed Horizon algorithm.
`Savage`	Determine the PAC-best arm with the SAVAGE algorithm.
`ScalableCopelandBandits`	An implementation of the Scalable Copeland Bandits (SCB) algorithm.
`SequentialElimination`	Implement the Sequential Elimination algorithm.
`SingleEliminationTop1Select`	The Top-1 Selection part of Single-Elimination Tournament.
`SingleEliminationTopKSorting`	Implements the top-k sorting algorithm in the Single-Elimination Tournament.
`SuccessiveElimination`	Implements the Successive Elimination with Comparison Sparsity algorithm.
`WinnerStaysStrongRegret`	Implements the strong regret version of the Winner Stays algorithm.
`WinnerStaysWeakRegret`	Implements the weak regret version of the Winner Stays algorithm.

class Algorithm(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int])¶

Parent class of all the implemented PB-MAB algorithms.

Parameters

feedback_mechanism – An object that describes the environment.
time_horizon – The number of duels the algorithm should perform. If a time horizon is given, the algorithm should perform exactly as many duels. May be None, in which case the algorithm will execute until an algorithm-specific termination condition is reached.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

time_horizon¶: The number of duels the algorithm should perform. If a time horizon is given, the algorithm should perform exactly as many duels. May be None, in which case the algorithm will execute until an algorithm-specific termination condition is reached.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class ApproximateProbability(feedback_mechanism: duelpy.feedback.FeedbackMechanism, order_arms: List[int], epsilon: float = 0.05, time_horizon: Optional[int] = None)¶

Bases: duelpy.algorithms.interfaces.PreferenceMatrixProducer

Implementation of the Approximate probability algorithm.

The goal is to approximate the pairwise preference matrix between all arms.

The algorithm assumes a total order over the existing arms and that strong stochastic transitivity and stochastic triangle inequality hold. Additionally, a \(\frac{\epsilon}{8}\)-approximate ranking over the arms has to be provided.

The bound on the expected regret is given as \(\mathcal{O}\left(\frac{N\min\left\{N,\frac{1}{\epsilon}\right\}}{\epsilon^2}\right)\), where \(N\) is the number of arms and \(\epsilon\) is the targeted estimation accuracy.

The approximate probability algorithm is based on Algorithm 5 in [FJO+18]. It’s an (\(\epsilon, \delta\))-PAC algorithm with \(\delta = \frac{1}{N^2}\) where \(N\) is the number of arms.

The algorithm takes an ordered set of arms and approximates all pairwise probabilities to an accuracy of \(\epsilon\). Note that in this implementation a ranking is defined as ordered from best to worst, whereas in [FJO+18], this is reversed. Probabilities are calculated starting with the worst arm against all others and then iterating down the ranking order. The result is guaranteed to be consistent with the ranking.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
epsilon – The optimality of the winning arm. Corresponds to \(\epsilon\) in [FJO+18]. Default value is 0.05, which has been used in the experiments in [FJO+18].
order_arms – A \(\frac{\epsilon}{8}\) ranking over the arms, ordered from best to worst.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

tournament_arms¶: The arms that are still in the tournament.

estimate_pairwise_probability¶

epsilon¶

comparison_arm¶: Iterate the number of comparisons of a specific arm.

order_arms¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.9, 0.9],
...     [0.1, 0.5, 0.7],
...     [0.1, 0.3, 0.5]
... ])
>>> feedback_mechanism = MatrixFeedback(preference_matrix=preference_matrix, random_state=np.random.RandomState(100))
>>> test_object = ApproximateProbability(feedback_mechanism, epsilon=0.05, order_arms=[0, 1, 2])
>>> test_object.run()
>>> test_object.get_preference_matrix()
array([[0.5, 0.9, 0.9],
...    [0.1, 0.5, 0.7],
...    [0.1, 0.3, 0.5]])

estimate_probabilities_against_worst_arm(self) → None¶

Run one step of comparison.

The last ranked and the other arms are dueled repeatedly, determining their preference probabilities.

estimate_pairwise_probabilities(self) → None¶

Run second step of comparison.

It compares arm \(i\) and arm \(j\) multiple times and estimates the pairwise probability.

duel_repeatedly(self, arm_i: int, arm_j: int) → float¶

Determine the preferred arm by repeated comparison.

It calculates the number of times arm \(i\) won against other arms in the set, and return the estimate pairwise probability.

step(self) → None¶: Take multiple samples per step in the algorithm.

is_finished(self) → bool¶

Determine if the algorithm is finished.

If the comparison arm is greater than tournament arms then it will terminate.

get_preference_matrix(self) → Optional[duelpy.stats.preference_matrix.PreferenceMatrix]¶

Return the computed preference matrix if it is ready.

Returns: The estimated pairwise preference matrix or None if the result is not ready.
Return type: Optional[PreferenceMatrix]

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class BeatTheMeanBandit(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, random_state: numpy.random.RandomState = None, gamma: float = 1.0)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implements the Beat the Mean Bandit algorithm.

The goal of this algorithm is to find the Condorcet winner.

It is assumed that a total order over the arms exists. Additionally relaxed stochastic transitivity and the stochastic triangle inequality are assumed.

The online version (if a time horizon is supplied) has a high probability cumulative regret bound of \(\mathcal{O}\left(\frac{\gamma^7 N}{\epsilon_\ast} \log T\right)\). The \(\gamma\) is part of the \(\gamma\)-relaxed stochastic transitivity assumption. \(N\) is the number of arms. \(\epsilon_\ast\) is the winning probability of the best arm against the second best arm minus \(0.5\).

This is an explore-then-exploit algorithm. The Beat the Mean algorithm, as described in [YJ11], proceeds in a sequence of rounds and maintains a working set of active arms during each round. For each active arm, an empirical estimate is maintained for how likely an arm is to beat the mean bandit of the working set. In each iteration, an arm with the fewest recorded comparisons is selected for comparison. We then enter an exploit phase by repeatedly choosing best_arm until reaching \(T\) total comparisons. The algorithm terminates only when one active arm remains, or when time horizon is reached. If a time horizon is given, this algorithm matches the online variant in [YJ11].

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – The total number of rounds.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
gamma – The \(\gamma\)-RST (corresponds to \(\gamma\) in [YJ11]) that the algorithm should assume for the given problem setting. The value must be greater than \(0\). A higher value of \(\gamma\) corresponds to a stronger assumption, i.e., \(\gamma = 1\) corresponds to SST. In theory it is not possible to assume more than a gamma of \(1\), but in practice you can still specify higher values. This will lead to tighter confidence intervals and possibly better results, but the theoretical guarantees do not hold in that case.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

comparison_history¶: A ComparisonHistory object which stores the history of the comparisons between the arms.

random_state¶

time_horizon¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.stats.preference_matrix import PreferenceMatrix
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = PreferenceMatrix(np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ]))
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> time_horizon = 10000  # time horizon greater than or equal to number of arms.
>>> btm = BeatTheMeanBandit(feedback_mechanism=feedback_mechanism, time_horizon=time_horizon, random_state=random_state, gamma=1.0)
>>> btm.run()
>>> best_arm = btm.get_condorcet_winner()
>>> best_arm
2
>>> cumul_regret = np.sum(feedback_mechanism.results["average_regret"])
>>> np.round(cumul_regret, 2)
903.1

explore(self) → None¶: Run one step of exploration.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

Returns: Whether exploration is finished.
Return type: bool

remove_from_working_set(self, worst_arm: int) → None¶

Remove the worst arm from the working set and update the comparison statistics.

Parameters: worst_arm – The index of the empirically worst arm.

get_condorcet_winner(self) → int¶

Return the arm with highest empirical estimate.

Returns: The arm with the highest empirical estimate.
Return type: int

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class BeatTheMeanBanditPAC(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: numpy.random.RandomState = None, gamma: float = 1.0, epsilon: float = 0.01, failure_probability: float = 0.1)¶

Bases: duelpy.algorithms.beat_the_mean.BeatTheMeanBandit

The PAC variant of the Beat the Mean Bandit algorithm.

The goal of this algorithm is to find the Condorcet winner.

It is assumed that a total order over the arms exists. Additionally \(\gamma\)-RST and the STI are assumed.

In the PAC setting (no time horizon given), the sample complexity is bound by \(O\left(\frac{N \gamma^6}{\epsilon^2} \log\frac{Nc}{\delta}\right)\). The constant \(c\) is given as \(\left\lceil \frac{36}{\gamma^6\epsilon^2}\log \frac{N}{\delta}\right\rceil\).

The PAC setting for Beat the Mean Bandit algorithm takes an ‘explore then exploit’ approach. In PAC setting, the exploration conditions for Beat the Mean Bandit algorithm is different from the Online setting. There are two termination cases for PAC exploration, the first case is when the active set has been reduced to a single bandit. The second case is when the number of comparisons recorded for each remaining bandit is at least opt_n (Corresponds to \(N'\) in section 3.1.2 in [BFHMP18]). We do not use the time horizon in Beat-the-Mean PAC setting (i.e., we set time_horizon = None); it is used only in the online setting.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment. This parameter has been taken from the parent class.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
gamma – The \(\gamma\)-RST (corresponds to \(\gamma\) in [YJ11]) that the algorithm should assume for the given problem setting. The value must be greater than \(0\). A higher value of \(\gamma\) corresponds to a stronger assumption, i.e., \(\gamma = 1\) corresponds to SST. In theory it is not possible to assume more than a gamma of \(1\), but in practice you can still specify higher values. This will lead to tighter confidence intervals and possibly better results, but the theoretical guarantees do not hold in that case. This parameter has been taken from the parent class.
epsilon – \(\epsilon\) in (\(\epsilon, \delta\)) PAC algorithms, given by the user.
failure_probability – Allowed failure probability (corresponds to \(\delta\) in Algorithm 2 in [YJ11]), i.e. probability that the actual value lies outside of the computed confidence interval. Derived from the Hoeffding bound.
time_horizon – Determines how many duels are executed in the online setting.

opt_n¶: Corresponds to \(N'\) in section 3.1.2 in [BFHMP18].

comparison_history¶: A ComparisonHistory object which stores the history of the comparisons between the arms.

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MatrixFeedback(preference_matrix=preference_matrix, random_state=random_state)
>>> btm = BeatTheMeanBanditPAC(feedback_mechanism=feedback_mechanism, random_state=random_state, epsilon=0.001, gamma=0.3)
>>> btm.run()
>>> comparisons = btm.wrapped_feedback.duels_conducted
>>> best_arm = btm.get_condorcet_winner()
>>> best_arm, comparisons
(2, 158)

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

Returns: Whether the algorithm is finished.
Return type: bool

explore(self) → None¶: Run one step of exploration.

remove_from_working_set(self, worst_arm: int) → None¶

Remove the worst arm from the working set and update the comparison statistics.

Parameters: worst_arm – The index of the empirically worst arm.

get_condorcet_winner(self) → int¶

Return the arm with highest empirical estimate.

Returns: The arm with the highest empirical estimate.
Return type: int

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class BinarySearchRanking(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, epsilon: float = 0.5, random_state: numpy.random.RandomState = None)¶

Bases: duelpy.algorithms.interfaces.CopelandRankingProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Binary Search Ranking algorithm.

This algorithm finds a \(\epsilon\)-Copeland ranking with probability \(1-\frac{1}{N}\), where \(N\) is the number of arms.

It is assumed that strong stochastic transitivity and stochastic triangle inequality hold.

The sample complexity is bound by \(\mathcal{O}\left(\frac{N \log N}{\epsilon^2}\right)\).

The algorithm divides arms into ordered bins, sorts these and finally merges them to get a ranking. To create the bins, first \(a = \frac{N}{(\log N)^3}\) anchor arms are selected. These randomly selected anchor arms are ranked using a Rank-3 algorithm, which in our case is Mergesort. Arm comparisons are repeated until the preference is known with sufficient confidence. A bin is created between consecutive anchor arms, two additional bins are added before the first and after the last arm. The remaining arms are then sorted into the bins. The arms in the bins are divided into arms close to the anchor and the others. These other arms are between two anchor arms and not close enough to either one. They are sorted using the Rank-3 algorithm. The last step is merging the anchor arms, the arms close to each anchor arm and the arms between anchor arms in the correct order.

The algorithm achieves a close to optimal sample complexity in theory, but there are large hidden constant factors, as can be seen in the example. The bias influences the sample complexity. Experiments with this implementation show for values greater than 0.5 a degradation of the probability of a correct result. For smaller values the sample complexity increases and the failure probability is stable at \(\frac{1}{N}\).

Refer to the paper [FOPS17] for more details.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Optional, the number of arm comparisons to execute.
epsilon – The bias term referred to as \(\epsilon\).
random_state – Used for random choices in the algorithm.

random_state¶

Examples

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MatrixFeedback(preference_matrix=preference_matrix, random_state=random_state)
>>> rank = BinarySearchRanking(feedback_mechanism, random_state=random_state)
>>> rank.run()
>>> rank.get_ranking()
[2, 4, 3, 1, 0]
>>> rank.wrapped_feedback.duels_conducted
1786021

class Node(lower_bound: int, upper_bound: int, parent: Optional[duelpy.algorithms.binary_search_ranking.BinarySearchRanking] = None)¶

Utility for implementation of Binary Search Tree.

These nodes represent intervals on an array. The interval is split in the middle and the child nodes are assigned the left and right subinterval respectively. If the interval contains only one value, no child nodes are created. Refer to Algorithm 9 in paper [FOPS17] for more details.

low¶: Points to the left boundary of the interval that the node represents.

high¶: Points to the right boundary of the interval that the node represents.

mid¶: Indicates where the interval is split for the child nodes.

parent¶: Points to the parent node in the tree.

left¶: Points to the left child node in the tree.

right¶: Points to the right child node in the tree.

get_children(self) → Tuple¶: Return the left and right child of current node.

get_parent(self) → Optional[duelpy.algorithms.binary_search_ranking.BinarySearchRanking]¶: Return the parent of the current node.

loser_arm_dummy¶

winner_arm_dummy¶

exploration_finished(self) → bool¶: Return True if the ranking of arms has been created.

get_ranking(self) → Optional[List[int]]¶: Return the ranked list of arms as decided by the algorithm.

explore(self) → None¶

Find the ranked set of arms.

Implement the Algorithm 4 (Binary Search Ranking).

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class BordaRanking(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, epsilon: float = 0.05, failure_probability: float = 0.1)¶

Bases: duelpy.algorithms.interfaces.BordaRankingProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Borda ranking algorithm.

The goal is to find a \(\epsilon\)-Borda ranking .

It is assumed that strong stochastic transitivity holds.

The bound on the sample complexity is given as \(\mathcal{O}\left(\frac{2N}{\ln\epsilon^2}\log \frac{2N}{\delta}\right)\).

The Borda ranking algorithm is based on Algorithm 9 in [FHO+17]. It is an (\(\epsilon\), \(\delta\))-PAC algorithm.

The algorithm first computes the estimated Borda score for all the arms in the set with an approximation error of \(\frac{\epsilon}{2}\) with confidence at least \(1 − \frac{\delta}{N}\), and the value of these estimated Borda scores will be use to rank the members of set.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – The number of steps that the algorithm is supposed to be run.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
epsilon – The optimality of the winning arm. Corresponds to \(\epsilon\) in [FHO+17]. Default value is 0.05 which has been used in the experiments in [FHO+17].
failure_probability – The probability that the result is not correct. Corresponds to \(\delta\) in [FHO+17]. Default value is 0.1 which has been used in the experiments in [FHO+17].

random_state¶

epsilon¶

failure_probability¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state=np.random.RandomState(20)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state)
>>> test_object = BordaRanking(feedback_mechanism, epsilon=0.05, failure_probability=0.1)
>>> test_object.run()
>>> test_object.get_ranking()
[2, 1, 0]
>>> test_object.wrapped_feedback.duels_conducted
9828

exploration_finished(self) → bool¶

Determine if the actual algorithm has terminated.

The algorithm terminates if all Borda scores are estimated.

Returns: Whether the algorithm is finished.
Return type: bool

explore(self) → None¶: Run one step of exploration.

get_ranking(self) → Optional[List[int]]¶

Return the estimated Borda ranking.

Returns: The estimated ranking.
Return type: Optional[List]

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class CopelandConfidenceBound(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: int, random_state: Optional[numpy.random.RandomState] = None, exploratory_constant: float = 0.501)¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer

Implement the Copeland Confidence Bound(CCB) algorithm.

The goal of the algorithm is to minimize the Copeland regret.

It is assumed that there are no ties between arms, i.e. the probability of any arm winning against another is never \(\frac{1}{2}\).

The bound on the expected regret is given as \(\mathcal{O}\left(\frac{N^2+(C+L_C)N \ln(T)}{\Delta^2}\right)\). \(N\) is the number of arms, \(T\) is the time horizon. \(C\) is the amount of Copeland winners and \(L_C\) the amount of arms a Copeland winner will lose agains in expectation. For any Copeland winner and any non-Copeland winner, \(\Delta\) is the smallest absolute distance to \(\frac{1}{2}\) in the probability of these arms dueling against each other. Note that the paper uses a different definition of Copeland regret, in this library the value is half of that in the paper.

The Copeland Confidence Bound algorithm is based on [ZKWdR15]. The performance of CCB degrades at about \(136\) arms in experiments, for details refer to [ZKWdR15]. It proceeds by conducting duels which are most informative about the precedence of the participating arms in terms of their Copeland scores. The confirmed non-Copeland winners are eliminated based on the results of the prior duels. After conducting sufficient rounds, the set of possible Copeland winners will converge which will result in minimal increment in Copeland regret and thus the goal would be achieved.

CCB runs continuously and in each time step it follows these steps:

1. Optimistic and Pessimistic estimates (namely U and L respectively) of the Preference matrix are calculated.

2. A Copeland winner candidate \(a_c\) is chosen using the optimistic estimate U such that it has a chance of being a true Copeland winner. \(a_c\) is chosen from a set of top scorers from U, especially those which are present in a list \(B_t\). \(B_t\) contains the arms which have a higher chance of being a Copeland winner.

3. A suitable opponent \(a_d\) is chosen using the pessimistic estimate L such that it can beat the notion of \(a_c\) being the true Copeland winner. By definition, U and L define a confidence interval around the original preference matrix. Using these confidence intervals, \(a_d\) is chosen such that a duel between \(a_c\) and \(a_d\) provides maximum information about their precedence over each other in terms of their Copeland scores. The historically proven strong opponents to \(a_c\) are maintained in a shortlist \(B_t^i\). The arms in this list are preferred while selecting an opponent. Such a list is maintained for every participating arm. The \(B_t^i\) lists for non-Copeland winners will contain a large number of opponents and thus help in their quick elimination from the list of possible winners.

4. Finally, a duel is conducted between \(a_c\) and \(a_d\) and the result is recorded.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Number of time steps to execute for.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
exploratory_constant – A parameter which is used in calculating the upper confidence bounds. The confidence radius grows proportional to the square root of this value. A higher upper confidence bound results in more exploration. Corresponds to \(\alpha\) in [ZKWdR15]. The value of exploratory_constant must be greater than \(0.5\). The default value is 0.501 which has been used in the experiments for calculating the confidence bounds in [ZWDRM14].

copeland_winner_candidates¶: The arms which have a higher possibilty of becoming a Copeland winner. Corresponds to \(B_t\) in [ZKWdR15].

max_allowed_losses¶: Maximum allowed losses for a Copeland winner. Corresponds to \(L_C\) in [ZKWdR15].

respective_opponents¶: A dictionary which has every arm as a key. Each value in this dictionary is a list. This list consists of the arms which are strong opponents to the arm present as the key. Corresponds to \(B_t^i\) in [ZKWdR15].

time_step¶: Number of rounds the algorithm has executed. Corresponds to \(t\) in [ZKWdR15].

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageCopelandRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, arms, random_state=random_state),
...     metrics={"copeland_regret": AverageCopelandRegret(preference_matrix)}
... )
>>> ccb = CopelandConfidenceBound(feedback_mechanism=feedback_mechanism, exploratory_constant=0.6, time_horizon=100, random_state=random_state)
>>> ccb.run()

The best arm in this case is the last arm (index 2)

>>> np.round(np.sum(feedback_mechanism.results["copeland_regret"]), 2)
47.25
>>> ccb.get_copeland_winner()
2

step(self) → None¶: Run one round of the algorithm.

get_copeland_winner(self) → Optional[int]¶

Determine a Copeland winner using CCB algorithm.

Returns: The first Copeland winner in order among all the estimated Copeland winners.
Return type: Optional[int]

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class CwRmed(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: int, random_state: Optional[numpy.random.RandomState] = None, exploratory_constant: float = 3.0, regret_bound_constant: float = 0.01)¶

Bases: duelpy.algorithms.algorithm.Algorithm

Implement the CW-RMED algorithm.

The goal of the algorithm is to minimize the Copeland regret. It is assumed that there are no ties between arms, i.e. the probability of any arm winning against another is never \(\frac{1}{2}\).

The bound on the expected regret is given as \(\frac{2N + \mathcal{o}(N)}{d_{KL}(\frac{1}{2} + \Delta, \frac{1}{2})}\). \(N\) is the number of arms, \(d_{KL}\) is the Bernoulli KL-divergence and \(\Delta\) is the smallest absolute distance to \(\frac{1}{2}\) in the probability of these arms dueling against each other.

The CW-RMED algorithm is based on []. It uses Bernoulli KL-Divergence between the preference probabilities of arms and \(\frac{1}{2}\). Further, linear optimization is used on these arm pairs to select the optimal pairs for duels.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Number of time steps to execute for.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
exploratory_constant – Corresponds to \(\alpha\) in []. Its value must be greater than or equal to 0. Default value is 3.0 which has been used in the experiments in [].
regret_bound_constant – Corresponds to \(\beta\) in []. Its value must be greater than or equal to 0. Default value is 0.01 which has been used in the experiments in [].

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

current_copeland_winner_pairs¶: Corresponds to \(L_{NC}\) in [].

dueling_pairs¶: Corresponds to \(L_C\) in [].

dueling_pairs_for_next_step¶: Corresponds to \(L_N\) in [].

remaining_pairs_for_dueling¶: Corresponds to \(L_R\) in [].

Raises: ValueError – Raised when the constants are given negative values.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageCopelandRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, arms, random_state=random_state),
...     metrics={"copeland_regret": AverageCopelandRegret(preference_matrix)}
... )
>>> cwrmed = CwRmed(
...     feedback_mechanism=feedback_mechanism,
...     time_horizon=100,
...     random_state=random_state
... )
>>> cwrmed.run()
>>> np.round(np.sum(feedback_mechanism.results["copeland_regret"]), 2)
31.5
>>> cwrmed.wrapped_feedback.duels_conducted
100

step(self) → None¶: Run one round of the algorithm.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class DoubleThompsonSampling(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, random_state: numpy.random.RandomState = None, exploratory_constant: float = 0.51)¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer

Implementation of the Double Thompson Sampling algorithm.

The goal of this algorithm is to find the Copeland winner while incurring minimal average Copeland regret. If there exist a Condorcet winner, the Copeland winner is also the Condorcet winner.

No further assumptions about the arms are needed.

It uses the average Copeland regret. For the general Copeland winner setting, D-TS achieves \(\mathcal{O}(K^2 \log T)\). Meanwhile, For the Condorcet setting and many practical Copeland settings, D-TS achieves \(\mathcal{O}(K \log T + K^2 \log \log T)\) using a back substitution argument.

The Double Thompson Sampling (D-TS) algorithm in paper [WL16] includes both the Condorcet and the general Copeland setting. D-TS uses a double sampling structure where the first as well as the second candidates are selected according to independently drawn samples from the beta posterior distribution and then dueled. The double sampling structure of D-TS is better suited for dueling bandits nature. Unlike RelaviveConfidenceSampling, launching two independent rounds of sampling provide us the opportunity to select the same arm in both rounds. This allows to compare the winners against themselves which significantly reduced regret. The confidence bounds in the algorithm are used to eliminate the unlikely arms which are ineligible to be winner arm and thus avoids suboptimal comparisons. While selecting the first candidate arm and the second candidate arm, the confidence bound is used to eliminate non-likely winners. So, D-TS is more robust in practice.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – This states the number of comparision to be done before the algorithm terminates.
random_state – Used for random choices in the algorithm.
exploratory_constant – Optional, the confidence radius grows proportional to the square root of this value. Corresponds to alpha in [WL16]. The value of exploratory_constant must be greater than \(0.5\). Default value is 0.51

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

exploratory_constant¶

random_state¶

time_horizon¶

preference_estimate¶: Stores estimates of arm preferences

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageCopelandRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"copeland_regret": AverageCopelandRegret(preference_matrix)}
... )
>>> test_object = DoubleThompsonSampling(feedback_mechanism, random_state=random_state, time_horizon=100)
>>> test_object.run()
>>> test_object.get_copeland_winner()
2
>>> np.round(np.sum(feedback_mechanism.results["copeland_regret"]), 2)
17.5

get_copeland_winner(self) → Optional[int]¶

Compute single Copeland winner as per the estimated preference matrix.

Returns: Copeland winner from preference estimate.
Return type: Optional[int]

step(self) → None¶: Run one round of an algorithm.

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class DoubleThompsonSamplingPlus(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, random_state: numpy.random.RandomState = None, exploratory_constant: float = 0.51)¶

Bases: duelpy.algorithms.double_thompson_sampling.DoubleThompsonSampling

Implementation of the extended version of D-TS.

The goal of this algorithm is to find the Copeland winner while incurring minimal average Copeland regret. If there exist a Condorcet winner, the Copeland winner is also the Condorcet winner.

It is assumed that the Copeland winner exists.

It uses the average Copeland regret. For the general Copeland bandit, D-TS+ achieves \(\mathcal{O}(N^2 \log T)\). Meanwhile, for the Condorcet dueling bandit and many practical Copeland dueling bandit, D-TS+ achieves \(\mathcal{O}(N \log T + N^2 \log \log T)\) using a back substitution argument. \(N\) is the number of arms.

As presented in [WL16], the D-TS+ algorithm just changes the tie-breaking criterion while selecting the first candidate(i.e the selection of the first candidate). During the selection of the first candidate , all the potential champions are selected as potential Copeland winners based on an upper estimate of the preference probability. Now, To remove the non-likely winners, preference probability between the arms is sampled using beta distribution. Then, For all the potential arms, one vs all arm regret based on sampled preference is calculated using KL divergence. The arm with minimal one-vs-all regret is selected as a first candidate. The one vs all regret of a potential arm is defined by the summation of Copeland regret of a respective potential arm with any other arm per KL-divergence (preference of potential arm over another arm, \(0.5\)).

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – This states the number of comparision to be done before the algorithm terminates.
random_state – Used for random choices in the algorithm.
exploratory_constant – Optional, the confidence radius grows proportional to the square root of this value. Corresponds to \(\alpha\) in [WL16]. The value of exploratory_constant must be greater than \(0.5\). Default value is 0.51.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

exploratory_constant¶

random_state¶

time_horizon¶

preference_estimate¶: Stores estimates of arm preferences

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageCopelandRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"copeland_regret": AverageCopelandRegret(preference_matrix)}
... )
>>> test_object = DoubleThompsonSamplingPlus(feedback_mechanism, exploratory_constant=0.51, random_state=random_state, time_horizon=100)
>>> test_object.run()
>>> test_object.get_copeland_winner()
2
>>> np.round(np.sum(feedback_mechanism.results["copeland_regret"]), 2)
52.5

get_copeland_winner(self) → Optional[int]¶

Compute single Copeland winner as per the estimated preference matrix.

Returns: Copeland winner from preference estimate.
Return type: Optional[int]

step(self) → None¶: Run one round of an algorithm.

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class VerificationBasedCondorcet(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: Optional[int] = None, failure_probability: float = 0.1, explore_failure_probability: Optional[float] = None, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Condorcet variant of the verification based algorithm.

This algorithm finds the Condorcet winner with a given failure probability.

It assumes a Condorcet winner exists.

The sample complexity of the algorithm is too complex to explain it here. It is given as Theorem 5 in [Kar16].

The algorithm executes multiple rounds of exploration followed by verification. The exploration step finds a possible Condorcet winner, which the verification step tests. If the verification fails, the process is repeated. The two parameters trade off failure probability with sample complexity. A lower failure_probability (\(\delta\)) means the verification takes longer, whereas a higher explore_failure_probability (\(\kappa\)) means more explore-verify rounds may be executed.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Optional, the number of time steps to execute for. If not provided, the algorithm terminates once a Condorcet winner is found.
failure_probability – The probability of returning an incorrect result. Referred to as \(\delta\) in the paper.
explore_failure_probability – Determines the probability of failure in the explore step. Referred to as \(\kappa\) in the paper. If not set, the default \(\min\left\{\frac{1}{3},\log\left(\frac{1}{\delta}\right)\right\}\) is used, as described in Corollary 4 of the paper.
random_state – Optional, used for random tie breaking. If not provided, a random state is created.

failure_probability¶

explore_failure_probability¶

round¶: The current exploration-verification round index.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, arms, random_state)
>>> vbc = VerificationBasedCondorcet(feedback_mechanism=feedback_mechanism, failure_probability=0.1, explore_failure_probability=0.5)
>>> vbc.run()

The best arm in this case is the last arm (index 2)

>>> vbc.wrapped_feedback.duels_conducted
3254
>>> vbc.get_condorcet_winner()
2

class CondorcetExplorer(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, failure_probability: float, random_state: Optional[numpy.random.RandomState])¶

Bases: duelpy.algorithms.algorithm.Algorithm

Implementation of an explorer for the Condorcet winner setting.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
failure_probability – The probability of returning an incorrect result.
random_state – Used for random tie breaking.

failure_probability¶

random_state¶

pair_is_active¶: Pairs of arms currently under investigation.

preference_estimate¶: Estimates for arm preferences.

reset(self) → None¶: Reset the exploration.

step(self) → None¶: Advance the algorithm by one step.

is_finished(self) → bool¶: Determine whether the verification is completed.

get_best_arm(self) → Optional[int]¶

Return the best arm, if it has been determined.

Returns: The index of the best arm or None, if not finished.
Return type: Optional[int]

get_adversaries(self) → Optional[List[int]]¶

Return the adversary for each arm.

Returns: A list containing the adversary indices assigned to the arms, None if not finished.
Return type: Optional[List[int]]

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class CondorcetVerifier(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, best_arm: int, adversaries: List[int], failure_probability: float)¶

Bases: duelpy.algorithms.algorithm.Algorithm

Implementation of a verifier for the Condorcet winner setting.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
best_arm – The best arm as computed by the exploration step.
adversaries – An adversary arm for each arm (the adversary of the best_arm is ignored).
failure_probability – The probability of returning an incorrect result.

best_arm¶

adversaries¶

failure_probability¶

pair_is_active¶: Pairs of arms currently under investigation.

preference_estimate¶: Estimates for arm preferences.

query_pairs(self) → None¶

Duel every active pair of arms once.

The order of the arms does not matter, even if both are present they should only be compared once.

step(self) → None¶: Advance the algorithm by one step.

is_finished(self) → bool¶: Determine whether the verification is completed.

has_succeeded(self) → bool¶

Determine if the verification succeeded, if it completed.

Returns: True for success.
Return type: bool

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

explore(self) → None¶

Advance the algorithm by one step.

This executes one explore step of the explore-then-exploit approach. This should not be confused with the interleaved explore and verify phases of the algorithm.

exploration_finished(self) → bool¶: Determine whether the verification is completed.

get_condorcet_winner(self) → Optional[int]¶

Get the arm with the highest probability of being the first in a ranking of the arms.

Returns: The best arm, None if has not been calculated yet.
Return type: Optional[int]

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class InterleavedFiltering(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, random_state: numpy.random.RandomState = None)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implements the Interleaved Filtering algorithm.

This algorithm finds the Condorcet winner.

A total order over arms, strong stochastic transitivity and the stochastic triangle inequality are assumed.

If the Condorcet winner is not eliminated, which happens with low probability, the expected regret is bound by \(\mathcal{O}\left(\frac{N}{\epsilon_\ast \log(T)}\right)\). \(\epsilon_\ast\) refers to the win probability of the best arm winning against the second best arm minus \(\frac{1}{2}\).

The algorithm is explained in [YBKJ12].

Exploration:

Interleaved Filtering follows a sequential elimination approach in the exploration phase and thereby finds the best arm with a probability of at least \(1-\frac{1}{T}\), where \(T\) is the time horizon. In each time step, the algorithm selects a candidate arm and compares it with all the other arms in a one-versus-all manner. If the algorithm selects an arm \(a\) (candidate arm), then it compares all the other arms with \(a\). If there exists any arm, \(b\) such that the upper confidence bound of \(a\) beating \(b\) is less than \(\frac{1}{2}\), then arm \(a\) is eliminated and arm \(b\) becomes the candidate arm and is compared to all other active arms. It applies a pruning technique to eliminate arm \(b\) if the lower confidence bound of \(a\) beating \(b\) is greater than \(\frac{1}{2}\), as it cannot be considered as the best arm with high probability. After the exploration, the candidate arm and the total number of comparisons are given as output.

Exploitation:

If the total number of comparisons is less than the given time horizon then the algorithm enters into the exploitation phase. In the exploitation phase, only the best arm from the exploration phase is pulled and compared to itself, assuming that the exploration found the best arm.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – For how many time steps the algorithm should run, must be greater or equal to the number of arms.
random_state – A numpy random state. Defaults to an unseeded state when not specified.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

failure_probability¶: Allowed failure-probability (\(\delta\)), i.e. probability that the actual value lies outside of the computed confidence interval. Derived from the Hoeffding bound.

candidate_arm¶: Randomly selected arm (corresponds to \(\hat{b}\) in [YBKJ12]) from the list of arms.

arms_without_candidate¶: The remaining set of arms (corresponds to \(W\) in [YBKJ12]) after removing the candidate arm.

preference_estimate¶: Estimation of a preference matrix based on samples.

time_horizon¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state=np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state)
>>> time_horizon = 1500
>>> interleaved_filtering = InterleavedFiltering(feedback_mechanism, time_horizon, random_state=random_state)
>>> interleaved_filtering.run()
>>> condorcet_winner = interleaved_filtering.get_condorcet_winner()
>>> condorcet_winner
2

get_condorcet_winner(self) → Optional[int]¶

Return the estimated Condorcet winner, assuming the algorithm has already run.

Returns: The condorcet winner in the set of arms given to the algorithm.
Return type: candidate_arm

explore(self) → None¶: Execute one round of exploration.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

If no time horizon is provided, this coincides with is_finished. Once this function returns True, the algorithm will have finished computing a PAC Copeland winner.

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class KLDivergenceBasedPAC(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, epsilon: float = 0.51, failure_probability: float = 0.05, preference_estimate: Optional[duelpy.stats.PreferenceEstimate] = None)¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implement the KL-divergence based PAC algorithm.

The goal of the algorithm is to find a Copeland winner in a PAC setting i.e. the algorithm finds an \(\epsilon\)-Copeland winner with probability at least \(1 - \delta\).

It is assumed that there are no ties between arms, i.e. the probability of any arm winning against another arm is never \(\frac{1}{2}\).

The bound on the expected regret is given as \(\mathcal{O}\left( \ln(\frac{N}{\delta\Delta_i\epsilon} \frac{1 - \mu_i}{(\Delta_i^\epsilon)^2} \right)\). \(N\) is the number of arms, \(\epsilon\) is the error parameter and \(\delta\) is the failure probability. \(\mu_i\) is the expected reward of arm \(a_i\). Let \(a_1\) be the arm which generates the maximum reward \(\mu_1\). \(\Delta_i = \max(\mathit{cpld}(a_1)-\mathit{cpld}(a_i), \frac{1}{N-1})\) where \(\mathit{cpld}(a_i)\) is the Copeland score of arm \(a_i\). \(\Delta_i^\epsilon = \max(\Delta_i, \epsilon(1 - \mathit{cpld}(a_1)))\)

This was originally introduced as a component of the ScalableCopelandBandits algorithm described in [ZKWdR15]. This implementation is based on Algorithm 2 (which further uses Algorithm 4) from the same paper. This algorithm uses KL-Divergence in the process of finding an approximate Copeland winner. An additional condition is used to terminate the exploration phase. This additional condition checks whether there is only one Copeland winner candidate left. If yes, then the exploration phase is stopped. This is done to ensure that the KL divergence based algorithm begins with the exploitation phase (i.e. dueling a PAC-Copeland winner against itself) earlier in some cases. Otherwise, it might continue to duel the one remaining candidate against random opponents for a while, which could incur a higher regret in the primary algorithm ScalableCopelandBandits.

The algorithm finds a Copeland winner based on the smallest and greatest probability distribution that has a low KL-Divergence (less than or equal to the value of \(\ln\left(\frac{4tN}{\delta}\right) + 2 \ln \ln(t)\) in Algorithm 4 in the paper) to the estimated preference probabilities.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Number of time steps to execute for. Corresponds to \(T\) in [ZKWdR15]. Defaults to None when not specified.
epsilon – The optimality of the winning arm. Corresponds to \(\epsilon\) in [ZKWdR15].
failure_probability – Upper bound on the probability of failure. Corresponds to \(\delta\) in [ZKWdR15].
random_state – A numpy random state. Defaults to an unseeded state when not specified.
preference_estimate – A PreferenceEstimate object is needed if this algorithm is used as a subroutine and the result is required to be stored in further rounds. The confidence radius of the given preference estimate will be overridden. Pass None if the algorithm should start from a new preference estimate. Defaults to None.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, arms, random_state)
>>> kldpac = KLDivergenceBasedPAC(feedback_mechanism=feedback_mechanism,
...     epsilon=0,failure_probability=0.05,random_state=random_state)
>>> kldpac.run()

The best arm in this case is the last arm (index 2)

>>> kldpac.wrapped_feedback.duels_conducted
89
>>> kldpac.get_copeland_winner()
2

get_copeland_winner(self) → Optional[int]¶

Get the Copeland winner.

Returns: The index of the estimated Copeland winner. None, if a winner could not be determined.
Return type: Optional[int]

exploration_finished(self) → bool¶

Determine whether exploration is finished.

This termination condition is based on the condition that is used in Algorithm 4 in [ZKWdR15]. In addition to that, it is also checked whether there exists only one Copeland winner candidate in order to end the exploration phase and start with the exploitation phase as per the explore-then-exploit principle.

Returns: True if the exploration should stop/has been stopped, False otherwise.
Return type: bool

explore(self) → None¶: Explore the given set of arms to obtain a Copeland winner.

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class KnockoutTournament(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, epsilon: float = 0.05, failure_probability: float = 0.1, stochasticity: float = 0.6)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the knockout tournament algorithm.

The goal of this algorithm is to find the \(\epsilon\)-Condorcet winner while minimizing the number of comparisons.

The algorithm assumes a total order over the existing arms and that strong stochastic transitivity, stochastic triangle inequality and relaxed stochastic transitivity hold.

The amount of pairwise comparisons made by the algorithm is bound by \(\mathcal{O}\left(\frac{N}{\epsilon^2}\left(1+\log\frac{1}{\delta}\right)\right)\), where \(N\) is the number of arms, \(\epsilon\) the maximal deviation from the solution and \(\delta\) is the error probability.

The algorithm was originally introduced in [FOPS17]. It is an \(\epsilon\)-\(\delta\)-PAC algorithm. It takes the set of arms as an input and compares them in rounds. At the end of each round, the size of the input is halved. The winning arm for a round is decided based on the allowed sub-optimality \(\epsilon\) and with a confidence interval based on the failure probability. The algorithm runs in rounds, where in each round it randomly pairs the arms into group and the winners are proceeded into the next round. For example that we have four arms [A, B, C, D]. It will first group the arms in pairs like [A, B] as the first pair and [C, D] as the second pair. After grouping them in pairs, the algorithm pulls out the winner from each pair, and the winners move to the next round.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – The number of steps that the algorithm is supposed to be run.
epsilon – The optimality of the winning arm. Corresponds to \(\epsilon\) in [FOPS17]. Default value is 0.05 which has been used in the experiments in [FOPS17].
failure_probability – The probability that the result is not an \(\epsilon\)-Condorcet winner. Corresponds to \(\delta\) in [FOPS17]. Default value is 0.1 which has been used in the experiments in [FOPS17].
stochasticity – The assumed stochastic transitivity parameter. Corresponds to \(\gamma\) in [FOPS17]. Default value is 0.6 which has been used in the experiments in [FOPS17].

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

tournament_arms¶: The arms that are still in the tournament.

epsilon¶

failure_probability¶

stochasticity¶

time_step¶: Number of rounds the algorithm has executed.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=np.random.RandomState(100))
>>> knockout_tournament = KnockoutTournament(feedback_mechanism, epsilon=0.1, failure_probability=0.3, stochasticity=0.6, time_horizon=300)
>>> knockout_tournament.run()
>>> best_arm = knockout_tournament.get_condorcet_winner()
>>> best_arm
2
>>> knockout_tournament.wrapped_feedback.duels_conducted
300

In this example the \(epsilon\)-Condorcet winner is the arm with index 2.

explore(self) → None¶

Execute one round of the algorithm.

Arms are paired into groups, and each pairs of arms is compared repeatedly.

exploration_finished(self) → bool¶

Determine if the exploration is finished.

The execution is finished when the time horizon is reached or when no time horizon was given and the Condorcet winner has been found.

Returns: Whether the algorithm is finished.
Return type: bool

get_condorcet_winner(self) → Optional[int]¶

Get the estimated Condorcet winner if it is ready.

Returns: The index of the winning arm
Return type: int

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class MallowsMPI(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, failure_probability: float = 0.05)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Mallows Most Preferred Item algorithm.

This algorithm finds the Condorcet winner with a given error probability.

It is assumed that the arms are sampled from a Mallows distribution, which is stricter than the total order assumption. See [BFHS14] for details on this distribution.

The amount of pairwise arm comparisons can is bound by \(\mathcal{O}\left(\frac{N}{\rho^2}\log\frac{N}{\delta\rho}\right)\), where \(N\) is the number of arms, \(\delta\) is the given error probability. The parameter \(\rho\) is dependent on the Mallows distribution parameter \(\phi\) as follows: \(\rho=\frac{1-\phi}{1+\phi}\).

This algorithm is part of the (\(\epsilon\), \(\delta\))-PAC class of algorithms, with \(\epsilon = 0\). The Condorcet winner is determined as the arm ranked first with the highest probability in the Mallows distribution. The algorithm proceeds by selecting a random arm and comparing it against another arm until one of them can be considered worse than the other with sufficient confidence. The worse arm is discarded and the winner is compared against a new randomly chosen arm. This continues until only one arm is left, which is then returned. See [BFHS14] for more details.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
time_horizon – Optional, the maximum amount of arm comparisons to execute. This may be exceeded, but will always be reached.
random_state – Optional, used for random choices in the algorithm.
failure_probability – An upper bound on the acceptable probability to fail, also called \(\delta\) in [BFHS14].

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

failure_probability¶

random_state¶

preference_estimate¶: Estimates for arm preferences.

Examples

Find the Condorcet winner in this example:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix,
...                                       random_state=random_state)
>>> mallows = MallowsMPI(feedback_mechanism, random_state=random_state, failure_probability=0.9)
>>> mallows.run()
>>> comparisons = mallows.wrapped_feedback.duels_conducted
>>> arm = mallows.get_condorcet_winner()
>>> arm, comparisons
(2, 65)

explore(self) → None¶: Explore arms by advancing the sorting algorithm.

get_condorcet_winner(self) → Optional[int]¶

Get the arm with the highest probability of being the first in a ranking of the arms.

Returns: The best arm, None if has not been calculated yet.
Return type: Optional[int]

exploration_finished(self) → bool¶: Determine whether the best arm has been found.

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class MallowsMPR(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, failure_probability: float = 0.05, sorting_algorithm: Optional[Type[duelpy.util.sorting.SortingAlgorithm]] = None)¶

Bases: duelpy.algorithms.interfaces.CopelandRankingProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of Mallows Most Probable Ranking Algorithm.

This algorithm computes a Copeland ranking with a given error probability.

It is assumed that the arms are sampled from a Mallows distribution, which is stricter than the total order assumption. See [BFHS14] for details on this distribution.

The amount of pairwise arm comparisons is bound by \(\mathcal{O}\left(\frac{N \log_2(N)}{\rho^2}\log\frac{N \log_2(N)}{\delta\rho}\right)\), where \(N\) is the number of arms, \(\delta\) is the given error probability. The parameter \(\rho\) is dependent on the Mallows distribution parameter \(\phi\) as follows: \(\rho=\frac{1-\phi}{1+\phi}\).

This algorithm recursively builds a Copeland ranking over the arms by sorting them using either Mergesort or Quicksort. Arms are compared repeatedly until sufficient confidence is obtained, this confidence is based on the Mallows distribution.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
time_horizon – Optional, the maximum amount of arm comparisons to execute. This may be exceeded, but will always be reached.
random_state – Used for the random pivot selection in the Quicksort mode.
failure_probability – An upper bound on the acceptable probability to fail, called \(\delta\) in [BFHS14].
sorting_mode – Determines which sort algorithm should be used, 'merge' for Mergesort or 'quick' for Quicksort.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

random_state¶

failure_probability¶

sorting_algorithm¶

preference_estimate¶: Estimates for arm preferences.

Examples

Find the Condorcet winner in this example:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix,
...                                       random_state=random_state)
>>> mallows = MallowsMPR(feedback_mechanism, random_state=random_state, failure_probability=0.9)
>>> mallows.run()
>>> comparisons = mallows.wrapped_feedback.duels_conducted
>>> ranking = mallows.get_ranking()
>>> ranking, comparisons
([2, 4, 3, 1, 0], 571)

explore(self) → None¶: Explore arms by advancing the sorting algorithm.

exploration_finished(self) → bool¶: Determine whether the ranking has been found.

get_ranking(self) → Optional[List[int]]¶

Get the computed ranking.

Returns: The ranking, None if it has not been calculated yet.
Return type: Optional[List[int]]

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class MergeRUCB(feedback_mechanism: duelpy.feedback.FeedbackMechanism, partition_size: int = 4, failure_probability: float = 0.01, exploratory_constant: float = 1.01, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Merge Relative Upper Confidence Bound algorithm.

The goal of the algorithm is to find the Condorcet winner while incurring minimum regret with minimum comparisons between the arms.

It is assumed that the Condorcet winner exists.

The regret is bounded by \(\mathcal{O}\left(N \log T\right)\) where \(N\) is the number of arms and \(T\) is the time horizon.

The algorithm described in the paper [MZdR15]. Dueling bandit algorithms have to learn something about the preference relation by pairwise comparison. Thus, they often have a worst-case sample complexity that scales with the square of the arms. MergeRUCB avoids this with a divide-and-conquer strategy: It divides the set of arms (batch) into multiple sub-sets (small batches), “solves” these smaller problems, and then merges the results. The batch of arms is divided into multiple small batches based on the partition size (\(P\)), which was decided to be greater than or equal to 4. Then these small batches are dueled, and the weak-arms are dropped from the small batches. After pruning of arms from the small-batch, all the batches are sorted and merged so that the new set of small batches lie in the range either \(\frac{P}{2}\) or \(\frac{3P}{2}\).

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
exploratory_constant – The confidence radius grows proportional to the square root of this value.
partition_size – Initial size of the batches.
failure_probability – Probability of failure.
random_state – Optional, used for random choices in the algorithm.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

partition_size¶

random_state¶

preference_estimate¶: Stores estimates of arm preferences

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state=np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> test_object = MergeRUCB(
...  feedback_mechanism=feedback_mechanism,
...  exploratory_constant=1.01,
...  random_state=random_state,
...  failure_probability=0.01)
>>> test_object.run()
>>> test_object.get_condorcet_winner()
2
>>> test_object.wrapped_feedback.duels_conducted
677
>>> np.round(np.sum(feedback_mechanism.results["average_regret"]), 2)
145.3

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

The exploration is finished once only a single batch with a single arm remains. This termination behavior is not explicitly specified in Algorithm 1 of [MZdR15] but it is mentioned in the corresponding explanation (Section 4).

Returns: True if the exploration phase is finished.
Return type: bool

explore(self) → None¶: Do one step of exploration.

get_condorcet_winner(self) → Optional[int]¶

Determine a Condorcet winner using Merge_RUCB algorithm.

Returns: The index of a Condorcet winner, if existent, among the given arms.
Return type: Optional[int]

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class Multisort(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.interfaces.CopelandRankingProducer

Implements the Multisort algorithm.

The goal of the algorithm is to find a ranking using Copeland aggregation on a set of rankings returned by the QuickSort algorithm.

It is assumed that the arms are distributed according to a Bradley-Terry distribution with parameter \(\theta\). This parameter is assumed to be sampled via a Poisson point process with given rate \(\lambda\).

Theorem 2 in section 3.1 in [MG17] states that all but a vanishing fraction of the items are correctly ranked using \(\mathcal{O}\left(\lambda^2 N\log^6 N\right)\) comparisons, where \(N\) refers to the number of arms and \(\lambda\) is the Poisson point process rate.

This algorithm recursively builds a Copeland ranking over the arms by sorting them using QuickSort with random pivot element in each time step. QuickSort returns a partial ranking of the pairwise comparisons and termiates after sampling \(\mathcal{O}(n\log n)\) comparisons with high probability. After having an aggregated Copeland scores over time horizon \(T\), an aggregated Copeland ranking is produced based on these scores. Multisort is neither a PAC (sample-complexity minimizing) algorithm nor a regret minimizing algorithm. Instead, it tries to come up with the best result possible in the given time horizon. This differs from the PAC setting, since it requires a time horizon. The probability of failure and the accuracy of the result are implicitly set by this time horizon. It differs from the regret-minimizing setting since it will never exploit its gathered knowledge. It will always “explore” and try to find a more accurate result, as long as the time horizon allows and regardless of the regret that is incurred during exploration.

See [MG17] for details.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
time_horizon – The maximum amount of arm comparisons to execute. This may be exceeded, but will always be reached.
random_state – Used for the random pivot selection in the Quicksort mode.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

random_state¶

Examples

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state)
>>> multisort = Multisort(feedback_mechanism=feedback_mechanism, time_horizon=1000, random_state=random_state)
>>> multisort.run()
>>> comparisons = multisort.wrapped_feedback.duels_conducted
>>> ranking = multisort.get_ranking()
>>> ranking, comparisons
([2, 4, 3, 1, 0], 1000)

explore(self) → None¶: Explore arms by advancing the sorting algorithm.

step(self) → None¶: Execute one step of the algorithm.

get_ranking(self) → Optional[List[int]]¶

Get the Copeland aggregation ranking.

Returns: The ranking, None if it has not been calculated yet.
Return type: Optional[List[int]]

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class OptMax(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, failure_probability: float = 0.1, time_horizon: Optional[int] = None, epsilon_range: float = 0.05, random_state: numpy.random.RandomState = RandomState())¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implement the OptMax algorithm.

The goal of this algorithm is to find an \(\epsilon\)-maximum arm among the given arms.

It assumes moderate stochastic transitivity.

As per Theorem 8 in the paper [FJO+18], the OptMax algorithm takes \(\mathcal{O}\left(\frac{N}{\epsilon^2} \log\left(\frac{1}{\delta}\right)\right)\) comparisons to find an \(\epsilon\)-maximum arm. \(N\) is the number of arms.

This algorithm computes the \(\epsilon\)-maximum arm by choosing one of the methods among _pick_anchor_for_lower_range or _pick_anchor_for_medium_range or _pick_anchor_for_higher_range. These methods are selected based on the failure_probability \(\delta\), which is calculated using the number of arms \(N\) given to the algorithm. Here \(\epsilon = \epsilon_u - \epsilon_l\) and \(\epsilon_u\) and \(\epsilon_l\) are upper and lower bias respectively.

The algorithm as presented in [FJO+18] either finds an arm which is either a \(\frac{2\epsilon}{3}\)-maximum arm or uses Sequential Elimination to find an arm which is \(\epsilon\)-maximum against the other arms.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
failure_probability – Refer to \(\delta\) in [FJO+18].
time_horizon – Sets a limit to the number of comparisons that the algorithm will make.
epsilon_range – Refer to \(\epsilon\) in [FJO+18]
random_state – A numpy random state. Defaults to an unseeded state when not specified.

Raises

ValueError – Raised when the wrong set of arms are given to the algorithm.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> import numpy as np
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.5],
...     [0.9, 0.5, 0.5],
... ])
>>> preferred_arms=[1,2]
>>> random_state_user = RandomState()
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state_user)
>>> opt_max = OptMax(feedback_mechanism=feedback_mechanism, failure_probability=0.1, random_state=random_state_user)
>>> opt_max.run()
>>> best_arm = opt_max.get_copeland_winner()
>>> best_arm in preferred_arms
True

exploration_finished(self) → bool¶

Determine whether algorithm has completed exploration.

Returns: Exploration finished or not.
Return type: bool

explore(self) → None¶

Run one step of exploration.

Exploration is divided into 3 parts. For more details, refer to Algorithm 4 of [FJO+18].

get_copeland_winner(self) → Optional[int]¶

Return the arm chosen by the algorithm as Copeland winner.

Returns

None – If the algorithm has not concluded.
int – If the algorithm has found the winner.

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class PlackettLuceAMPR(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, failure_probability: float = 0.05, epsilon: float = 0.1)¶

Bases: duelpy.algorithms.interfaces.CopelandRankingProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Plackett-Luce Approximate Most Probable Ranking algorithm, which computes a ranking over the arms.

This algorithm assumes the arms are sampled from the Plackett-Luce distribution. This distribution assigns utilities to arms, from which win probabilities can be inferred. For more information, see [SBFPH15].

To compute a ranking, the algorithm proceeds by repeating the following steps. First, the arms are divided into connected components, that is groups, for which the confidence intervals around the estimated ranking positions overlap. Arms in the same group can not be ordered confidently. That is, the best guess is that they are of equal rank. The order of arms in different groups is known, so each group can be analyzed in isolation. The arms in each group are compared by executing the Budgeted QuickSort algorithm. This leads to shrinking confidence intervals. Two conditions allow the algorithm to terminate. Either all groups only contain one arm, at which point the ranking is known, or, some groups exist, whose arms are \(\epsilon\)-close to each other. In the second case, the arms are assumed to be equal, the ties are broken an arbitarily.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
time_horizon – Optional, the maximum amount of arm comparisons to execute. This may be exceeded, but will always be reached.
random_state – Used for the random pivot selection in the Quicksort mode.
failure_probability – An upper bound on the acceptable probability to fail, called \(\delta\) in [SBFPH15].
epsilon – Acceptable difference to optimum, also called \(\epsilon\) in [SBFPH15].

random_state¶

preference_estimate¶: Estimates for arm preferences.

Examples

Find the Ranking in this example:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix,
...                                       random_state=random_state)
>>> pl = PlackettLuceAMPR(feedback_mechanism, random_state=random_state, failure_probability=0.9, epsilon=0.1)
>>> pl.run()
>>> comparisons = pl.wrapped_feedback.duels_conducted
>>> ranking = pl.get_ranking()
>>> ranking, comparisons
([2, 4, 3, 1, 0], 1097)

class Bounds(lower_bound: int, upper_bound: int)¶

Helper class to keep track of lower and upper rank bounds.

union(self, other: duelpy.algorithms.plackett_luce.PlackettLuceAMPR) → None¶: Combine two bounds objects by union.

intersection_test(self, other: duelpy.algorithms.plackett_luce.PlackettLuceAMPR) → bool¶: Check whether two bounds objects intersect.

class Component(arms: List[int], bounds: duelpy.algorithms.plackett_luce.PlackettLuceAMPR)¶

Helper class to keep track of arm components, i.e. arm subgroups.

merge(self, other: duelpy.algorithms.plackett_luce.PlackettLuceAMPR) → None¶: Merge two Component objects and update the resulting bounds.

explore(self) → None¶: Explore arms by advancing the sorting algorithm.

step(self) → None¶: Execute one step of the algorithm.

get_ranking(self) → Optional[List[int]]¶

Get the computed ranking.

Returns: The ranking, None if it has not been calculated yet.
Return type: Optional[List[int]]

exploration_finished(self) → bool¶: Determine whether the best arm has been found.

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class PlackettLucePACItem(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: Optional[numpy.random.RandomState] = None, failure_probability: float = 0.05, epsilon: float = 0.1)¶

Bases: duelpy.algorithms.interfaces.AllApproximateCondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implementation of the Plackett-Luce PAC-Item algorithm.

This algorithm finds for a given confidence all arms that are \(\epsilon\)-close to the Condorcet winner.

It assumes arms are distributed according to the Plackett-Luce distribution, which assigns a utility to each arm. The utilities of two arms determine the probability of either arm winning against the other. For details on this distribution see [SBFPH15].

The sample complexity of the algorithm is bound by \(\mathcal{O}\left(\frac{\max_{i\neq i^\ast}}{\Delta_i^2} \log\left(\frac{N}{\Delta_i \delta}\right)\right)\). Here, \(N\) is the number of arms and \(\Delta_i=\frac{\max\{\epsilon,p_{i^\ast,i}-\frac{1}{2}\}}{2}\), where \(p_{i^\ast,i}\) is the probability of the best arm winning against arm i.

The algorithm repeatedly sorts arms with a comparison budget-constrained QuickSort algorithm and eliminates inferior arms.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
time_horizon – Optional, the maximum amount of arm comparisons to execute. This may be exceeded, but will always be reached.
random_state – Used for the random pivot selection in the Quicksort mode.
failure_probability – An upper bound on the acceptable probability to fail, called \(\delta\) in [SBFPH15].
epsilon – Acceptable difference to optimum, also called \(\epsilon\) in [SBFPH15].

random_state¶

preference_estimate¶: Estimates for arm preferences.

Examples

Find the Condorcet winner in this example:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1, 0.1, 0.1],
...     [0.9, 0.5, 0.3, 0.2, 0.2],
...     [0.9, 0.7, 0.5, 0.8, 0.9],
...     [0.9, 0.8, 0.2, 0.5, 0.2],
...     [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix,
...                                       random_state=random_state)
>>> plackett_luce = PlackettLucePACItem(feedback_mechanism, random_state=random_state, failure_probability=0.1, epsilon=0.01)
>>> plackett_luce.run()
>>> comparisons = plackett_luce.wrapped_feedback.duels_conducted
>>> arm = plackett_luce.get_approximate_condorcet_winners()
>>> arm, comparisons
([2], 476)

explore(self) → None¶: Execute one exploration step.

exploit(self) → None¶: Exploit knowledge by uniformly random selection two of the epsilon-delta Condorcet winners.

exploration_finished(self) → bool¶: Determine whether the best arm has been found.

get_approximate_condorcet_winners(self) → Optional[List[int]]¶

Get the arm with the highest probability of being the first in a ranking of the arms.

Returns: The best arm, None if has not been calculated yet.
Return type: Optional[int]

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class RelativeConfidenceSampling(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: int, random_state: Optional[numpy.random.RandomState] = None, exploratory_constant: float = 0.501)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer

Implementation of the Relative Confidence Sampling algorithm.

The goal of this algorithm is to find a Condorcet winner while incurring minimal regret.

It is assumed that a Condorcet winner exists.

No regret or sample complexity bounds were established in the source paper [ZWDRM14].

The algorithm proceeds by conducting duels among the candidate arms and eliminating the sub-optimal arms based on the results of the prior duels. After conducting sufficient rounds, the algorithm would always choose the Condorcet winner to conduct a duel with itself. This would result in no more regret and thus the goal would be achieved.

Relative Confidence Sampling works continuously in the following 3 steps:

1. A simulated tournament is conducted among all the arms to obtain a champion. The tournament is based on sampling from a beta distribution. The beta distribution is parameterized on the results of the prior duels among the competing arms. The Condorcet winner of this simulated tournament is selected as the current champion. If no Condorcet winner exists, the arm that has been selected as the champion the least number of times is the new champion.

2. A challenger which has the highest chance of winning against the current champion is obtained using the upper confidence bound. The upper confidence bound is based on the probability estimates of winning against the champion. These probability estimates can be derived for every arm using the results of the prior duels against the champion.

A duel is conducted among the champion and the challenger and the results are stored.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
time_horizon – How many comparisons the algorithm should do. This does not impact the decision of the algorithm, only for how many times step executes. May be None to indicate a unknown or infinite time horizon.
exploratory_constant – A parameter which is used in calculating the upper confidence bounds. The confidence radius grows proportional to the square root of this value. A higher upper confidence bound results in more exploration. Corresponds to \(\alpha\) in [ZWDRM14]. The value of exploratory_constant must be greater than \(0.5\). Default value is 0.501 which has been used in the experiments related to RCS in [ZWDRM14].

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

exploratory_constant¶

random_state¶

time_step¶: Number of rounds the algorithm has executed.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import WeakRegret
>>> from duelpy.stats.preference_matrix import PreferenceMatrix
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"weak_regret": WeakRegret(preference_matrix)}
... )
>>> rcs = RelativeConfidenceSampling(feedback_mechanism=feedback_mechanism, time_horizon=100, exploratory_constant=0.6, random_state=random_state)
>>> rcs.run()

The best arm in this case is the last arm (index 2)

>>> np.round(np.sum(feedback_mechanism.results["weak_regret"]), 2)
0.8
>>> rcs.get_condorcet_winner()
2

step(self) → None¶: Run one round of the algorithm.

get_condorcet_winner(self) → Optional[int]¶

Determine a Condorcet winner using RCS algorithm.

Returns: The index of a Condorcet winner, if existent, among the given arms.
Return type: Optional[int]

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class RelativeUCB(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, exploratory_constant: float = 0.51, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer

Implementation of the Relative Upper Confidence Bound algorithm.

The goal of this algorithm is to find the Condorcet winner while incurring minimal regret.

It is assumed that a Condorcet winner exists.

The regret bounds of this algorithm can be found in [ZWMR14].

The algorithm is presented in [ZWMR14]. RUCB is an extension of the Upper Confidence Bound (UCB) algorithm for regular multi-armed bandits. It is motivated by learning from the relative feedback rather than real valued feedback between two arms. It works for both finite as well as for infinite time horizons. The major goals of this algorithm are to minimize cumulative regret over time for the K-armed dueling bandit problem and also return a Condorcet winner.

In each time-step RUCB executes three sub-parts sequentially:

Initially, assume all arms as a potential champion. All arms are compared in pairwise optimistically fashion using upper confidence bound. If the upper confidence bound of an arm against any other arm is less than \(0.5\), then that “loser” is removed from the potential champions. This process keeps on and when we are left with only one arm in the pool then that arm is assigned as the hypothesized best arm. There is always at most one hypothesized best arm. This hypothesized best arm \(B\) is demoted from its status as soon as it loses to another arm and from the remaining potential champions arm, a potential champion arm \(arm_c\) is chosen in two ways: if \(B\) is not present, we sample an arm uniformly randomly. If \(B\) is present, the probability of picking the arm \(B\) is set to \(\frac{1}{2}\) and the remaining arms are given equal probability for being chosen.
Regular UCB is performed using potential champion \(arm_c\) as a benchmark. Now, we select challenger arm \(arm_d\) (distinct from \(arm_c\)) whose upper confidence bound is maximal with reference to the potential champion \(arm_c\).
Now the potential champion \(arm_c\) and challenger arm \(arm_d\) are compared. Based on the comparison, the winner arm is decided and the win count is updated. At last, the Condorcet winner is returned as the arm whose winning count is maximum.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – How many comparisons the algorithm should do. This does not impact the decision of the algorithm, only for how many steps run executes. May be None to indicate a unknown or infinite time horizon.
exploratory_constant – Optional, The confidence radius grows proportional to the square root of this value. Corresponds to \(\alpha\) in [ZWMR14]. The value of exploratory_constant must be greater than \(0.5\). The default value is 0.51.
random_state – Optional, used for random choices in the algorithm.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

preference_estimate¶: Estimation of a preference matrix based on samples.

exploratory_constant¶

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> test_object = RelativeUCB(
...     feedback_mechanism=feedback_mechanism,
...     time_horizon=100,
...     exploratory_constant=0.6,
...     random_state=random_state,
... )
>>> test_object.run()
>>> test_object.get_condorcet_winner()
2
>>> np.round(np.sum(feedback_mechanism.results["average_regret"]), 2)
19.2

get_condorcet_winner(self) → Optional[int]¶

Determine a Condorcet winner using RCS algorithm.

Returns: The index of a Condorcet winner, if existent, among the given arms.
Return type: Optional[int]

step(self) → None¶: Run one round of an algorithm.

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class Rmed1(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, divergence_tolerance: Optional[Callable[[int], numpy.float]] = None, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.algorithm.Algorithm

Implementation of Relative Minimum Empirical Divergence 1 algorithm.

The goal of this algorithm is to minimize regret through the relative comparison of the arms.

This algorithm assumes there exists a Condorcet winner.

The expected regret is upper bounded by \(O(\sum_{i\neq i^*}\frac{\log T}{KL(q_{i^*,i},1/2)})+O(K^{2+\epsilon})\) where \(KL(p, q)\) is the Kullback-Leibler divergence of Bernoulli random variables with parameters \(p\) and \(q\), \(\epsilon>0\) is exploratory constant, a parameter of the algorithm and \(q_{i^*,i}\) is prefrence probability of the best arm (\(i^*\)) compared to any arm (\(i\)). The asymtotically optimal lower regret bound K-duel bandit problem is \(\Omega(\sum_{i\neq i^*}\frac{\log T}{KL(q_{i^*,i},1/2)})\). Rmed algorithms, presented in paper [KHKN15], is based on empirical divergance where empirical divergance of arm \(a_i\) at time t is \(I_{a_i}(t)=\sum_{a_j:q_{i,j}^t \leq 1/2} n_{i,j}^t KL(q_{i, j}^t,1/2)\) , \(n_{i,j}\) number of comparision of \(a_i\) and \(a_j\) , \(KL(q_{i,j}^t, 1/2)\) is kullback-Leibler divergence of Bernoulli random variables with parameters \(p\) and \(q\), where \(p\) is the probability of one arm being better than the other and \(q\) is the probability of drawn arm that is likely to be the Condorcet winner with high probability. Initially, unique pairs of arms are generated and compared for exploration. Each unique pair of arms is compared for \(L\) times, where \(L=1\) for Rmed1. At the end of the initial exploration phase, the timestep is equal to \(t=\frac{L(K-1)K}{2}\) where K is number of arms. At each subsequent step, the set of remaining arms is reduced to get a set of likely Condorcet winners or a Condorcet winner.

The second phase proceeds by creating a set of arms as follows: arm_current_loop_set (\(L_C\)) tracks the arm in the current loop in each step, arm_remaining_loop_set (\(L_R\)) the arms remaining after the loop and arm_condorcet_candidate_set (\(L_N\)) contains the Condercet candidate arms in each loop. In each step, the loop continues until all the arms have been selected from \(L_C\). Initially, all the arms are present in \(L_C\) and \(L_R\). Meanwhile, \(L_N\) is empty. At each step, \(L_C\) contains the remaining set of arms. In each iteration, a likely winner arm is selected from \(L_C\). After that a challenger arm is to be selected for the likely winner. For the selection of a challenger arm, an arm is selected from the opponent set. The opponent set is the set of arms for which the estimated preference probability compared to the reference arm arm is greater than \(0.5\). If the opponent set is empty, the best arm is selected as the challenger arm else an arm with minimum preference estimate is selected. The best arm is the arm which has the minimum empirical divergence among all the arms at the current timestep. Next the candidate and the challenger are dueled. The candidate is removed from the \(L_R\). The likely winner arm is verified whether the arm is the potential Condorcet winner. If it is valid, it is added to the list of potential Condorcet winner set \(L_N\) and the loop continues. After the end of the loop the active arms ( \(L_C\) and \(L_R\)) are set to the Condorcet candidates of the current loop (\(L_N\)) while the Condorcet candidates are reset for the next iteration. winner` set \(L_N\) and the loop continues. After the end of the loop the active arms (\(L_C\) and \(L_R\)) are set to the Condorcet candidates of the current loop (\(L_N\)) while the Condorcet candidates are reset for the next iteration. If the time horizon is not yet reached the algorithm takes the next step with the updated values of \(L_C\), \(L_R\) and \(L_N\).

Based on some studies, the K-armed dueling bandit problem has a lower regret bound of \(\Omega(K \log T)\). In the paper [KHKN15],The algorithm has been shown to reach this lower bound math:Omega(K log T) under the Condorcet assumption, even up to the constant factor. Also, RMED is the first asymptotically optimal algorithm in the study of the dueling bandit problem.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – This states the number of comparision to be done before the algorithm terminates.
random_state – Optional, used for random choices in the algorithm.
divergence_tolerance – Optional, Any non-negative function \(f(K)\) that is independent of timestep, where \(K\) is the number of arms. It is is used to determine the size of the allowed “empirical divergence” gap between the currently estimated best arm and the remaining candidate arms . The “tolerance” or “empirical divergence” gap does not only depend on this function, but also on the current time step. For more details refer to equation 4 in section 3.1 of [KHKN15]. Default value is taken from the experiment which is \(f(K)=0.3*K^(1.01)\).

preference_estimate¶: Estimation of a preference matrix based on samples.

feedback_mechanism¶

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> divergence_tolerance = lambda num_arms: 0.3 * np.power(num_arms, 1.01)
>>> test_object = Rmed1(
...     feedback_mechanism,
...     divergence_tolerance=divergence_tolerance,
...     random_state=random_state,
...     time_horizon=100
... )
>>> test_object.run()
>>> np.round(np.sum(feedback_mechanism.results["average_regret"]), 2)
20.8

step(self) → None¶: Each step of the algorithm after the some initial drawn pairs of arm for the competition.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class Rmed2(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, divergence_tolerance: Optional[Callable[[int], numpy.float]] = None, random_state: Optional[numpy.random.RandomState] = None, exploratory_constant: float = 3)¶

Bases: duelpy.algorithms.rmed.Rmed1

Implementation of Relative Minimum Empirical Divergence 2 algorithm.

The goal of this algorithm is to minimize regret through the relative comparison of the arms.

This algorithm assumes there exists a Condorcet winner. The preference matrix is estimated through a series of relative comparison.

The working of the algorithm Rmed2, presented in paper [KHKN15], is similar to the Rmed1 algorithm. It slightly differs in explore phase where Initially each unique pair of arm is dueled \(L\) times, where \(L=\lceil \alpha \log{\log{T}} \rceil\), and T is the time horizon. In each algorithm step, there is explore and exploit phase. In explore phase, all the unique pairs are drawn again until the number of duels (N) is less than or equal to \(\alpha \log{ \log{t}}\) where \(t\) is the current time step. The exploit phase is similar to the Rmed1, the only difference is selecting the challenger arm. The concept of divergence plus is introduced. Select challenger arm from the arm pool which has the minimum divergence plus with selected likely winner arm. If the challenger arm meets the condition to be challenger arm, it is a candidate target else the challenger arm is selected as that of Rmed1 algorithm.

As to match the asymptotical regret lower bound, Rmed1 is modified to Rmed2 algorithm with a constant factor. Rmed2 uses the same subroutine as Rmed1, but it only differs on how the comparison target is selected to the first chosen arm. Rmed2 tries to select the arm that is most likely to be the Condorcet winner in most of the algorithm’s rounds/step . It also explores in order to reduce regret, but sometimes the regret increases if it fails to estimate the real Condorcet winner.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – This states the number of comparision to be done before the algorithm terminates.
exploratory_constant – Optional, The exploratory constant is related to the number of times each arm pair is compared. Corresponds to \(\alpha\) in paper. The value of exploratory_constant must be greater than \(0.5\). Default value is 3.
random_state – Optional, used for random choices in the algorithm.
divergence_tolerance – Optional, Any non-negative function \(f(K)\) that is independent of timestep, where \(K\) is the number of arms. It is is used to determine the size of the allowed “empirical divergence” gap between the currently estimated best arm and the remaining candidate arms. The “tolerance” or “empirical divergence” gap does not only depend on this function, but also on the current time step. For more details refer to the equation section 3.1 in paper [KHKN15]. Default value is taken from the experiment which is \(f(K)=0.3*K^(1.01)\).

preference_estimate¶: Estimation of a preference matrix based on samples.

feedback_mechanism¶

exploratory_constant¶

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> divergence_tolerance = lambda num_arms: 0.3 * np.power(num_arms, 1.01)
>>> test_object = Rmed2(
...     feedback_mechanism,
...     divergence_tolerance=divergence_tolerance,
...     random_state=random_state,
...     time_horizon=100
... )
>>> test_object.run()
>>> np.round(np.sum(feedback_mechanism.results["average_regret"]), 2)
21.4

empirical_divergence_plus(self, arm_i: int, arm_j: int) → float¶

Calculate the empirical divergence plus of an arm with reference arm.

Parameters

arm_i – The reference arm.
arm_j – The arm whose empirical divergence is to be calculated.

Returns

The empirical divergence plus.

Return type

float

step(self) → None¶: Each step of the algorithm after the some initial drawn pairs of arm for the competition.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class Rmed2FH(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: int, divergence_tolerance: Optional[Callable[[int], numpy.float]] = None, random_state: Optional[numpy.random.RandomState] = None, exploratory_constant: float = 3)¶

Bases: duelpy.algorithms.rmed.Rmed2

Implementation of Relative Minimum Empirical Divergence 2 Fixed Horizon algorithm.

The goal of this algorithm is to minimize regret through the relative comparison of the arms.

This algorithm assumes there exists a Condorcet winner. The preference matrix is estimated through a series of relative comparison.

It is the static version of RMED2 called RMED2 Fixed horizon (presented in paper [KHKN15] ). RMED2FH shows that it is an asymptotically optimal algorithm under Condorcet assumption. The working of the algorithm is similar to the Rmed2 algorithm. It slightly differs in explore phase where initially, each unique pair of arms is dueled \(L\) time where \(L=1\) like that in Rmed1. And, In each algorithm step, there is only an exploit phase. The exploit phase is similar to the Rmed1 only the difference is selecting the challenger arm. The concept of fixing challenger arm for each likely winner arm arises. For each likely winner arm , a challenger arm is selected from the arm pool which has the minimum divergence plus with likely winner arm. The challenger is fixed after the initial explore phase. If the challenger arm meet the condition to be challenger arm it is a candidate target else the challenger arm is selected as that of Rmed1 algorithm.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – This states the number of comparision to be done before the algorithm terminates.
exploratory_constant – Optional, The exploratory constant is related to the number of times each arm pair is compared. Corresponds to \(\alpha\) in paper. The value of exploratory_constant must be greater than \(0.5\). Default value is 3.
random_state – Optional, used for random choices in the algorithm.
divergence_tolerance – Optional, Any non-negative function \(f(K)\) that is independent of timestep, where \(K\) is the number of arms. It is is used to determine the size of the allowed “empirical divergence” gap between the currently estimated best arm and the remaining candidate arms . The “tolerance” or “empirical divergence” gap does not only depend on this function, but also on the current time step. For more details refer to the equation section 3.1 in paper [KHKN15]. Default value is taken from the experiment which is \(f(K)=0.3*K^(1.01)\).

preference_estimate¶: Estimation of a preference matrix based on samples.

feedback_mechanism¶

exploratory_constant¶

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"average_regret": AverageRegret(preference_matrix)}
... )
>>> divergence_tolerance= lambda num_arms: 0.3 * np.power(num_arms, 1.01)
>>> test_object = Rmed2FH(
...     feedback_mechanism,
...     divergence_tolerance=divergence_tolerance,
...     random_state=random_state,
...     time_horizon=100
... )
>>> test_object.run()
>>> np.round(np.sum(feedback_mechanism.results["average_regret"]), 2)
21.8

empirical_divergence_plus(self, arm_i: int, arm_j: int) → float¶

Calculate the empirical divergence plus of an arm with reference arm.

Parameters

arm_i – The reference arm.
arm_j – The arm whose empirical divergence is to be calculated.

Returns

The empirical divergence plus.

Return type

float

step(self) → None¶: Each step of the algorithm after the some initial drawn pairs of arm for the competition.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class Savage(feedback_mechanism: duelpy.feedback.FeedbackMechanism, failure_probability: float = 0.1, time_horizon: Optional[int] = None)¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer, duelpy.algorithms.interfaces.PacAlgorithm

Determine the PAC-best arm with the SAVAGE algorithm.

This algorithm makes no assumptions about the environment.

The sample complexity is bounded by \(\sum_{i=1}^N \mathcal{O}\left(\frac{\log\left(\frac{N}{\delta\Delta_i}\right)}{\Delta_i^2}\right)\) if the time horizon \(T\) is finite and \(\sum_{i=1}^N \mathcal{O}\left(\frac{\log\left(\frac{NT}{\delta}\right)}{\Delta_i^2}\right)\) otherwise.

SAVAGE is a general algorithm that can infer some information about an environment from samples. It works by repeatedly sampling possible environments (in the case of PB-MAB, an environment is specified by a preference matrix) and eliminating

those environment candidates that fall outside of the current confidence interval (for example the sets of preference matrices that would make our previous samples too unlikely) and
those environment variables (preference matrix entries) that are no longer relevant on the current environment candidates (for example the arms that cannot be the Copeland winner). See Figure 1 in [UCFN13] for an illustration. In this case \(\mu\) is the preference matrix while \(x_1\) and \(x_2\) are two entries of the matrix (without loss of generality it is sufficient to estimate the upper-right triangle of the matrix). If we already know that arm i is strictly better than arm j, it is no longer necessary to test arm i and we can stop trying to improve our estimate on \(q_{ik}\).

Environment parameters in the PB-MAB case are the upper triangle of the preference matrix. The goal goal is to design a sequence of pairwise experiments (samples of random variables) / duels to find the best arm (according to ranking procedure). This is called voting bandits since we use a pairwise election criterion to find the best bandit, meaning beating for Copeland, or better expectation for a Borda.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment
failure_probability – Upper bound on the probability of failure (the \(\delta\) in (\(\epsilon\),:math:delta)-PAC).
time_horizon – The number of steps that the algorithm is supposed to be run. Specify None for an infinite time horizon.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

failure_probability¶

preference_estimate¶: The current estimate of the preference matrix.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=np.random.RandomState(42))

Obviously, the last arm (index 2) is expected to win against the most other arms. That makes it the Copeland winner, as SAVAGE is correctly able to determine:

>>> algorithm = Savage(feedback_mechanism)
>>> algorithm.run()
>>> algorithm.get_copeland_winner()
2

copeland_independence_test(self, arm_pair: Tuple[int, int]) → bool¶

Test if the result of a duel can still influence our estimate of the Copeland winner.

This corresponds to the “IndepTest” in the paper.

Parameters: arm_pair – The pair of arms in question.
Returns: False if more information about the arm pair is still needed. True if the Copeland estimation is not dependant on further information.
Return type: bool

explore(self) → None¶: Run one step of exploration.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

If no time horizon is provided, this coincides with is_finished. Once this function returns True, the algorithm will have finished computing a PAC Copeland winner.

get_copeland_winner(self) → Optional[int]¶

Find a Copeland winner with the SAVAGE algorithm.

Note that only the correctness of any one of the Copeland winners is covered by the failure probability. The probability that all arms in the set are actually Copeland winners is lower. We still return the full set of arms for convenience.

Returns: The indices of the \(\delta\)-PAC best (Copeland) arms. The \(\delta\) failure probability refers to any individual arm, but not all arms together.
Return type: Set[int]

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class ScalableCopelandBandits(feedback_mechanism: duelpy.feedback.feedback_mechanism.FeedbackMechanism, time_horizon: int, random_state: Optional[numpy.random.RandomState] = None)¶

Bases: duelpy.algorithms.algorithm.Algorithm

An implementation of the Scalable Copeland Bandits (SCB) algorithm.

The goal of the algorithm is to minimize the Copeland regret.

It is assumed that there are no ties between arms, i.e. the probability of any arm winning against another is never \(\frac{1}{2}\).

The bound on the expected regret is given as \(\mathcal{O}\left(\frac{N(L_C+\ln(N)) \ln(T)}{\Delta_\min)^2}\right)\). \(N\) is the number of arms, \(T\) is the time horizon. \(C\) is the number of Copeland winners and \(L_C\) is the number of arms against which a Copeland winner will lose in expectation. For any Copeland winner and any non-Copeland winner, \(\Delta\) is the smallest absolute distance to \(\frac{1}{2}\) in the probability of these arms dueling against each other. \(\Delta_\min\) is the smallest \(\Delta\) value for the given set of arms such that \(\Delta_\min \ne 0\). Note that the paper uses a different definition of Copeland regret, in this library the value is half of that in the paper.

The Scalable Copeland Bandits algorithm is based on [ZKWdR15]. SCB performs well for a large number of arms (\(500\) or more); refer to [ZKWdR15] for more details. It proceeds by conducting duels which are most informative about the precedence of the participating arms in terms of their Copeland scores. The confirmed non-Copeland winners are eliminated based on the results of the prior duels. After conducting sufficient rounds, the set of possible Copeland winners will converge which will result in minimal increment in Copeland regret and thus the goal would be achieved.

This algorithm uses a KL-Divergence based PAC algorithm as a subroutine to determine a Copeland winner. The subroutine is based on Algorithm 2 and Algorithm 4 stated in [ZKWdR15]. An additional termination condition is used in its implementation. This additional condition stops the exploration phase of the subroutine when there is only one Copeland winner candidate left. The additional condition is introduced to ensure that the KL-Divergence based PAC based algorithm starts with the exploitation phase (i.e. dueling a PAC-Copeland winner against itself) earlier in some cases. Otherwise, the algorithm might continue to duel the one remaining candidate against random opponents, which could incur a higher regret.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – Number of time steps to execute for.
random_state – A numpy random state. Defaults to an unseeded state when not specified.

rounds¶: Number of rounds the algorithm has executed. Corresponds to \(r\) in [ZKWdR15].

preference_estimate¶: The current estimate of the preference matrix.

time_budget¶: The number of comparisons allowed in each round. This quantity is dependent on the current round, hence varies with each round. Corresponds to \(T\) in [ZKWdR15].

copeland_winner¶: The estimated Copeland winner.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import AverageCopelandRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5]
... ])
>>> arms = list(range(len(preference_matrix)))
>>> random_state = np.random.RandomState(20)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, arms, random_state=random_state),
...     metrics={"copeland_regret": AverageCopelandRegret(preference_matrix)})
>>> scb = ScalableCopelandBandits(feedback_mechanism=feedback_mechanism, time_horizon=1000,
...     random_state=random_state)
>>> scb.run()
>>> np.round(np.sum(feedback_mechanism.results["copeland_regret"]), 2)
76.0
>>> scb.wrapped_feedback.duels_conducted
1000

step(self) → None¶: Run one round of the algorithm.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class SequentialElimination(feedback_mechanism: duelpy.feedback.FeedbackMechanism, random_state: numpy.random.RandomState = None, time_horizon: Optional[int] = None, failure_probability: float = 0.1, epsilon_lower: float = 0.0, epsilon_upper: float = 0.05, arms_subset: Optional[List] = None, anchor_arm: Optional[int] = None)¶

Bases: duelpy.algorithms.interfaces.SingleCopelandProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implement the Sequential Elimination algorithm.

The goal of this algorithm is to find an \(\epsilon\)-maximum arm.

The algorithm computes a PAC estimation of the Copeland winner. An arm is \(\epsilon\)-maximum (where \(\epsilon = \epsilon_u-\epsilon_l\)), if it is preferable to other arms with probability at least \(0.5-\epsilon\).

If the anchor arm provided to the algorithm is a good anchor element, then there are only m elements for which element a is not \(\epsilon_l\) preferable. This means, all other elements will be eliminated but among these \(m\) elements, there can be at most \(m\) changes of anchor element. Thus, there can be at most m rounds and hence we can bound total comparison rounds by \(\mathcal{O}(N + m^2)\). \(N\) is the number of arms.

Thus this PAC algorithm reduces the comparisons to at most m elements which are not \(\epsilon_l\) preferable and the remaining n-m elements are \(\epsilon_l\) perferable and hence are removed with comparison complexity of \(\mathcal{O}(N)\).

Refer to the paper [FHO+17].

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – The number of steps that the algorithm is supposed to be run. Specify None for an infinite time horizon.
failure_probability – Determines the number of iterations that both arms are compared against. Corresponds to \(\delta\) in [FHO+17]. Default value is 0.1, as given in section 6.
epsilon_lower – Default value is 0.0. Refer to section 3.1.1 in [FHO+17].
epsilon_upper – Corresponds to \(\epsilon\) with default value is 0.5, as given in section 3.1.1 in [FHO+17].
arms_subset – Represents the list of arms which is sent by other algorithms and is the subset from list of arms fetched from feedback_mechanism.
anchor_arm – If none is provided, it is selected randomly from the list of arms provided to the algorithm. Otherwise, it represents the anchor arm extracted from feedback_mechanism.get_arms(). A good anchor element is an arm for which every other arm (being \(\epsilon_l\) preferable) is deemed worse and gets eliminated.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

failure_probability¶

preference_estimate¶

Raises: ValueError – Raised when the value of upper epsilon is not greater than lower epsilon.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> import numpy as np
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.5],
...     [0.9, 0.5, 0.5],
... ])
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=np.random.RandomState())
>>> sequential_elimination = SequentialElimination(feedback_mechanism, random_state=np.random.RandomState())
>>> sequential_elimination.run()
>>> preferred_arms = [1,2]
>>> best_arm = sequential_elimination.get_copeland_winner()
>>> best_arm in preferred_arms
True

explore(self) → None¶

Compare the current anchor arm against a randomly selected arm.

The anchor arm is updated with the arm beating the current anchor arm and the new anchor arm is compared against the remaining arms step by step. Refer to section 3.1.1 in paper [FHO+17].

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

If no time horizon is provided, this coincides with is_finished. Once this function returns True, the algorithm will have finished computing a :term`PAC` Copeland winner.

get_copeland_winner(self) → Optional[int]¶

Return the Copeland winner arm selected by the algorithm.

Returns

None – If the algorithm has not concluded.
int – If the algorithm has found the winner.

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class SingleEliminationTop1Select(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, preference_separation: Optional[float] = None, duels_per_comparison: Optional[int] = None, arms_subset: Optional[List[int]] = None, preference_estimate: Optional[duelpy.stats.preference_estimate.PreferenceEstimate] = None, epsilon: float = 0.01)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer, duelpy.algorithms.interfaces.PacAlgorithm

The Top-1 Selection part of Single-Elimination Tournament.

The goal of this algorithm is to find the top (Rank = 1) arm while minimizing the sample complexity.

A total order over arms and strong stochastic transitivity are assumed.

The amount of pairwise comparisons made by the algorithm is bound by \(O\left(\frac{ N \log\log N}{\Delta_{1}}\right)\), where \(N\) is the number of arms, and \(\Delta\) is the preference separation.

The algorithm was originally introduced in [MSE17]. It contains many layers. In every layer the arms are paired in a random manner. One arm from each pair is selected with the help of pairwise comparisons between the two arms, while the other arm is eliminated. As the duel between the arms is from a random observation, the duel is repeated \(m\) (in [MSE17]) number of times, thus establishing a probability distribution to the duel. The algorithm gives the top-1 arm with adequately large number of binary comparisons (larger \(m\) in [MSE17]).

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
preference_separation – The assumed preference separation between the best arm and all other arms. Assumed to be 0.01 if neither preference_separation nor duels_per_comparison is specified. Corresponds to \(\Delta_{1,S}\) in [MSE17].
duels_per_comparison – Number of times each comparison is repeated before coming to a conclusion. If this is not specified, an optimal value that guarantees the assumptions in the paper is computed from preference_separation. See [MSE17] for more details. Corresponds to \(m\) in [MSE17].
epsilon – \(\epsilon\) in \((\epsilon, \delta)\)-PAC algorithms, given by the user.
arms_subset – The set of arms given to the algorithm by other algorithms otherwise the amrs from feedback_mechanism will be taken.
preference_estimate – A PreferenceEstimate object is needed if this algorithm is used as a subroutine and the result is required to be stored in further rounds. The default value is None.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

preference_estimate¶: Estimation of a preference matrix based on samples.

condorcet_winner¶: Top-1 arm in the given set of arms.

duels_per_comparison¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state=np.random.RandomState(3)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state)
>>> single_elimination_tournament = SingleEliminationTop1Select(feedback_mechanism)
>>> single_elimination_tournament.run()
>>> condorcet_winner = single_elimination_tournament.get_condorcet_winner()
>>> condorcet_winner
2

explore(self) → None¶: Run one step of exploration.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

Once this function returns True, the algorithm will have finished computing a PAC Condorcet winner.

get_condorcet_winner(self) → Optional[int]¶

Return the estimated PAC-Condorcet winner.

Returns: The Condorcet winner in the set of arms given to the algorithm.
Return type: int

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class SingleEliminationTopKSorting(feedback_mechanism: duelpy.feedback.FeedbackMechanism, k_top_ranked: Optional[int] = None, time_horizon: Optional[int] = None, preference_separation: Optional[float] = None, duels_per_comparison: Optional[int] = None, epsilon: float = 0.01)¶

Bases: duelpy.algorithms.interfaces.PartialRankingProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implements the top-k sorting algorithm in the Single-Elimination Tournament.

The goal of this algorithm is to find the top-k arms while minimizing the sample complexity.

The algorithm assumes a total order over the arms.

The algorithm has sample complexity of \(\mathcal{O}\left(\frac{(N+k \log k) \max \{\log k, \log \log N\}}{\Delta_{k}}\right)\) where \(\Delta_{k}=\min _{i \in[k]} \min _{j: j \geq i} \Delta_{i, j}^{2}\) in the case of top-k ranking and \(\Delta_{k}=\Delta_{k, k+1}^{2}\) in the case of top-k identification. \(N\) is the number of arms.

The algorithm divides the dataset or the set of arms into \(k\) sub-groups each of size \(\frac{N}{k}\). From every sub-group a top arm is selected by using the TopOneSelection algorithm and short list all the winners. A (max-) heap data structure is built from the short list,there by getting the top arm from the obtained heap, which will be the root element of the heap. Then the top arm is removed from the short list. In order to find the second best arm, again the home sub-group from which the previous top arm is taken, is accessed and the second best arm is identified and added to the short list. This process of identifying and removing is repeated for \(k - 1\) times, untill all the top-k arms are identified. See Algorithm 2 in [MSE17] for more details.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
preference_separation – The assumed preference separation between the best arm and all other arms. Assumed to be 0.01 if neither preference_separation nor duels_per_comparison is specified. Corresponds to \(\Delta_{1,S}\) in [MSE17].
duels_per_comparison – Number of times each comparison is repeated before coming to a conclusion. If this is not specified, an optimal value that guarantees the assumptions in the paper is computed from preference_separation. See [MSE17] for more details. Corresponds to \(m\) in [MSE17].
k_top_ranked – The desired number of top arms in the given set of arms. If this is not specified it is taken as 2.
epsilon – \(\epsilon\) in \((\epsilon, \delta)\)-PAC algorithms, given by the user.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

preference_estimate¶: Estimation of a preference matrix based on samples.

top_k_arms¶: List of top k arms given by the algorithm.

sub_groups¶: Set of arms divided into \(k\) sub-groups each of size \(\frac{N}{k}\).

sub_group_index¶: Index of the sub group.

top_1_selection_class¶: Storing an instance of SingleEliminationTop1Select class.

short_list¶: From every sub-group a top arm is selected and a short list of all the winners is created.

heap¶: Storing an instance of Heap class.

algorithm_stage¶: The stage of the algorithm.

rank_index¶: Present rank index.

heap_updated¶: Check whether the heap is updated or not.

duels_per_comparison¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state=np.random.RandomState(100)
>>> feedback_mechanism = MatrixFeedback(preference_matrix, random_state=random_state)
>>> single_elimination_tournament = SingleEliminationTopKSorting(feedback_mechanism)
>>> single_elimination_tournament.run()
>>> top_k_arms = single_elimination_tournament.get_partial_ranking()
>>> top_k_arms
[2, 1]

explore(self) → None¶: Run exploration phase of the algorithm.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

Once this function returns True, the algorithm will have finished computing a PAC Copeland winner.

get_partial_ranking(self) → Optional[List[int]]¶

Return the Copeland winner given by the algorithm.

Returns: The top-k Copeland winners in the set of arms given to the algorithm.
Return type: list

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class SuccessiveElimination(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: numpy.random.mtrand.RandomState = None, sparsity_level: Optional[int] = None, failure_probability: float = 0.1, time_gate: int = 0)¶

Bases: duelpy.algorithms.interfaces.BordaProducer, duelpy.algorithms.interfaces.PacAlgorithm

Implements the Successive Elimination with Comparison Sparsity algorithm.

The goal of the algorithm is to find a \(\delta\)-PAC Borda winner.

The algorithm has a sample complexity of \(\mathcal{O}(n \log(n))\) and succeeds with probability at least \(1-\delta\). Here, \(n\) is the total number of arms and \(\delta\) is the failure probability.

It is assumed that a unique Borda winner exists and that there are no ties between arms, i.e. the probability of any arm winning against another is never \(\frac{1}{2}\).

The algorithm presented in [JKDN15] works on the exploration problem of finding the Borda winner from noisy comparisons in a sparsity model. A sparsity model assumes a small set of top candidates that are similar to each other, and a large set of irrelevant candidates that would always lose in a pairwise comparison with one of the top candidates. The algorithm exploits this sparsity model to find the Borda winner using fewer samples than standard algorithms. The algorithm implements the successive elimination strategy given in [EDMM06] with the Borda reduction and an additional elimination criterion that exploits sparsity. The algorithm maintains an active set of arms (\(A_t\) in [JKDN15]) of potential Borda winners. At each round \(t\), the algorithm chooses an arm uniformly at random (\(I_t\) in [JKDN15]) from the set and compares it with all the arms in the active set. A parameter time_gate (\(T_0\) in [JKDN15]) is specified to guarantee that all arms with sufficiently large Borda score gaps \(s_1 - s_i\) are eliminated by round \(T_0\). This condition is fulfilled by condition 2 in [JKDN15]. Once \(t>T_0\), i.e., round > time_gate, condition 1 also becomes active and the algorithm starts removing the arms with large partial Borda gaps, exploiting the assumption that the top arms can be distinguished by comparisons with a sparse set of other arms.

The algorithm terminates when only one arm remains.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – The total number of rounds.
random_state – A numpy random state. Defaults to an unseeded state when not specified.
sparsity_level – Refers to \(k\) in Algorithm 1 in [JKDN15]. The assumed size of the sparsity set. This is the set of similar and nearly-optimal arms which are easy to differentiate from all other arms. Should be a value between \(1\) and \(n - 2\), where \(n\) is the size of the set of arms. Defaults to 3 if at least \(5\) arms are available, \(n - 2\) if only \(3\) or \(4\) arms are available. This corresponds to \(k\) in [JKDN15].
failure_probability – Refer to \(\delta\) in [JKDN15].
time_gate – It is specified to guarantee that all arms with sufficiently large Borda gaps are eliminated when the number of rounds becomes greater than time_gate. Theorem 2 in [JKDN15] specifies the requirement of time_gate parameter. Corresponds to \(T_0\) in [JKDN15]. The paper [JKDN15] suggests \(T_0 = 0\) based on their experiments in section 5.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

random_state¶

time_horizon¶

failure_probability¶

time_gate¶

round¶

sparsity_level¶

borda_scores_array¶: An array which stores the borda scores of each arm.

confidence_factor¶: Refer to \(C_t\) in [JKDN15].

current_working_set¶: The set with possible Borda winners from the set \(N\). Corresponds to \(A_t\) in Algorithm 1 in [JKDN15].

Raises: ValueError – Raised when the given sparsity level is invalid or the number of arms is too small to choose a default value. See the documentation of the sparsity_level parameter for details.

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> preference_matrix = np.array([
... [0.5, 0.1, 0.1, 0.1, 0.1],
... [0.9, 0.5, 0.3, 0.2, 0.2],
... [0.9, 0.7, 0.5, 0.8, 0.9],
... [0.9, 0.8, 0.2, 0.5, 0.2],
... [0.9, 0.8, 0.1, 0.8, 0.5]
... ])
>>> random_state = np.random.RandomState(43)
>>> feedback_mechanism = MatrixFeedback(preference_matrix=preference_matrix, random_state=random_state)
>>> secs = SuccessiveElimination(feedback_mechanism=feedback_mechanism, random_state=random_state, time_horizon=1000, time_gate = 50, failure_probability=0.1)
>>> secs.run()
>>> borda_winner = secs.get_borda_winner()
>>> comparisons = secs.wrapped_feedback.duels_conducted
>>> borda_winner, comparisons
(2, 1000)

explore(self) → None¶: Run one step of exploration.

exploration_finished(self) → bool¶

Determine whether the exploration phase is finished.

Returns: Whether exploration is finished.
Return type: bool

get_borda_winner(self) → Optional[int]¶

Return the computed Borda winner if it is ready.

This will only return a result when step has been called a sufficient amount of times. If this is a PAC algorithm, the result might be approximate.

Returns: The arm with the highest Borda score.
Return type: int

exploit(self) → None¶: Run one step of exploitation.

abstract step(self) → None¶

Run one step of the algorithm.

This corresponds to a logical step of the algorithm and may perform multiple comparisons. What exactly a “logical step” is depends on the algorithm.

The feedback mechanism is wrapped with a BudgetedFeedbackMechanism therefore conducting duels within a step may raise an exception if a time horizon is specified.

If you want to run an algorithm you should use the run function instead.

Raises: TimeBudgetExceededException – If the step has been terminated early due to the time horizon. The exception class is local to the object. Different instances raise different exceptions. The exception class for this algorithm can be accessed through the exception_class attribute of the wrapped feedback mechanism.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class WinnerStaysStrongRegret(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, exploitation_factor: float = 2, random_state: numpy.random.RandomState = np.random.RandomState())¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer

Implements the strong regret version of the Winner Stays algorithm.

The goal of this algorithm is to find the Condorcet winner while minimizing the strong regret suffered in the process.

The algorithm assumes at the very least that a Condorcet winner exists, but the expected regret improves if a total order over the arms exists.

The incurred strong regret is dependent on the duels made \(T\) and on the number of arms \(N\): \(\mathcal{O}(N^2 + N \log(T))\). If a total order over the arms exists, this is improved to \(\mathcal{O}(N \log(T) + N \log(N))\).

This algorithm is based on the weak regret version. It interleaves the weak regret Winner Stays algorithm with exponentially increasing periods of pure exploitation (pulling the currently believed-to-be-best arm twice). As soon as we have found the best arm, the strong regret in the exploitation phase will be \(0\). Since the duration is exponentially increasing, this leads to a strong regret of \(0\) per round in the limit. For details see [CF17].

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
exploitation_factor – Determines the length of rounds, i.e. how often the best arm should be pulled in each round. It should be larger than \(1\). The default is 2, which results in a doubling of the round length. This parameter is called \(\beta\) in [CF17].
time_horizon – How many comparisons the algorithm should do. This does not impact the decision of the algorithm, only for how many times step executes. May be None to indicate a unknown or infinite time horizon.
random_state – Optional, used for random choices in the algorithm.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

exploitation_factor¶

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import StrongRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.1, 0.1],
...     [0.9, 0.5, 0.3],
...     [0.9, 0.7, 0.5],
... ])
>>> random_state = np.random.RandomState(1)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"strong_regret": StrongRegret(preference_matrix)}
... )
>>> ws_wr = WinnerStaysStrongRegret(feedback_mechanism, random_state=random_state)
>>> for t in range(100):
...     ws_wr.step()
>>> np.round(np.sum(feedback_mechanism.results["strong_regret"]), 2)
1.8

step(self) → None¶

Execute one iteration of the algorithm.

Step through one iteration of the algorithm. In the first iteration of each round, the weak regret winner stays is consulted, then the best arm is chosen.

get_condorcet_winner(self) → int¶

Get the index of the arm currently believed to be the Condorcet winner.

Returns: The Condorcet winner
Return type: int

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

class WinnerStaysWeakRegret(feedback_mechanism: duelpy.feedback.FeedbackMechanism, time_horizon: Optional[int] = None, random_state: numpy.random.RandomState = None)¶

Bases: duelpy.algorithms.interfaces.CondorcetProducer

Implements the weak regret version of the Winner Stays algorithm.

The goal of this algorithm is to find the Condorcet winner while minimizing the weak regret suffered in the process.

The algorithm assumes at the very least that a Condorcet winner exists, but the expected regret improves if a total order over the arms exists.

The incurred weak regret is constant in time and only depends on the number of arms \(N\): \(\mathcal{O}(N^2)\). If a total order over the arms exists, this is improved to \(\mathcal{O}(N \log(N))\).

The algorithm is described in [CF17]. The algorithm is tournament-based. It stores the difference between won and lost duels for each arm. The next arms are then selected from the set of arms with the highest difference. If one of the actions from the previous round is still in this argmax set, it is chosen again. In the first round and if the actions of the previous round are not part of the argmax set, the actions are chosen uniformly at random from it. The two chosen actions are guaranteed to be not identical.

Parameters

feedback_mechanism – A FeedbackMechanism object describing the environment.
time_horizon – How many comparisons the algorithm should make. This does not impact the decision of the algorithm, only for how many times step executes. May be None to indicate a unknown or infinite time horizon.
random_state – Optional, used for random choices in the algorithm.

wrapped_feedback¶: The feedback_mechanism parameter with an added decorator. This feedback mechanism will raise an exception if a time horizon is given and a duel would exceed it. The exception is caught in the run function.

win_deltas¶: Stores the difference between won and lost rounds for each arm. Corresponds to the \(C(t,i)\) values in [CF17].

random_state¶

Examples

Define a preference-based multi-armed bandit problem through a preference matrix:

>>> from duelpy.feedback import MatrixFeedback
>>> from duelpy.stats.metrics import WeakRegret
>>> from duelpy.util.feedback_decorators import MetricKeepingFeedbackMechanism
>>> preference_matrix = np.array([
...     [0.5, 0.4, 0.4],
...     [0.6, 0.5, 0.3],
...     [0.6, 0.7, 0.5],
... ])

>>> random_state = np.random.RandomState(3)
>>> feedback_mechanism = MetricKeepingFeedbackMechanism(
...     MatrixFeedback(preference_matrix, random_state=random_state),
...     metrics={"weak_regret": WeakRegret(preference_matrix)}
... )
>>> ws_wr = WinnerStaysWeakRegret(feedback_mechanism, random_state=random_state)
>>> for t in range(100):
...    ws_wr.step()
>>> np.round(np.sum(feedback_mechanism.results["weak_regret"]), 2)
0.6

step(self) → None¶

Execute one round of the algorithm.

Run through one round of the algorithm. First, the arms with the largest wins-losses difference are selected. Then the feedback is used to update the win-loss statistic.

get_condorcet_winner(self) → int¶

Get the index of the arm currently believed to be the Condorcet winner.

Returns: The Condorcet winner
Return type: int

exploit(self) → None¶: Run one step of exploitation.

is_finished(self) → bool¶

Determine if the algorithm is finished.

This may be based on a time horizon or a different algorithm-specific termination criterion if time_horizon is None.

run(self) → None¶

Run the algorithm until completion.

Completion is determined through the is_finished function. Refer to its documentation for more details.

regret_minimizing_algorithms¶

other_algorithms¶

interfaces¶

__all__¶

duelpy.algorithms¶

Submodules¶

Package Contents¶

Classes¶

`duelpy.algorithms`¶