Abstract: |
We consider a restless multi-armed bandit in which each armcan be in one two states. When an arm is sampled, the state of the armis not available to the sampler. Instead, a binary signal with a knownrandomness that depends on the state of the arm is made available. Nosignal is displayed if the arm is not sampled. An arm-dependent reward isaccrued from each sampling. In each time step, each arm changes stateaccording to known transition probabilities which in turn depend onwhether the arm is sampled or not sampled. Since the state of the arm isnever visible and has to be inferred from the current belief and apossible binary signal, we call this the hidden Markov bandit. Ourinterest is in a policy to select the arm(s) in each time step thatmaximises the infinite horizon discounted reward. Specifically, we seekthe use of Whittle’s index in selecting the arms.We first analyze the single-armed bandit and show that it admits anapproximate threshold-type optimal policy when the ‘no-sample’ actionis subsidised. Next, we show that this also satisfies an approximate-indexability property. Numerical examples support the analytical results |