Inria Lille, SequeL Team
Thursday, December 17th, 2020
Given a collection of unknown measurement distributions.
$$\nu_1: \mu_1, \nu_2: \mu_2, \cdots, \nu_K: \mu_K$$
$$\mu^\star = \color{red}{\max}_i \mu_i$$
$$I^\star = \color{red}{\operatorname{arg\,max}}_i \mu_i$$
Formally, a BAI algorithm is composed of:
We are interested in Top-Two Thompson Sampling (Russo, 2016)
A prior distribution $\Pi_1$ over $\Theta$, $\mu\in\Theta$
$(Y_{1,I_1},\cdots,Y_{n-1,I_{n-1}}) → \Pi_n$
\[ \pi_n(\mu') = \frac{\pi_1(\mu')L_{n-1}(\mu')}{∫_{\Theta}\pi_1(\mu'')L_{n-1}(\mu'')d\mu''} \] where \[ L_{n-1}(\mu') = ∏_{l=1}^{n-1}p(Y_{l,I_l}|\mu'_{I_l})\,. \]
What we know about TTTS... (Posterior convergence)
Theorem (Russo, 2016) Under TTTS and under the previous boundedness assumptions, it holds a.s. \[ \lim_{n\rightarrow{\infty}} -\frac{1}{n}\log(1-\alpha_{n,I^\star}) = \color{red}{\Gamma_{\beta}^\star}, \] where \begin{equation*} \quad \alpha_{n,i} \triangleq \Pi_{n}(\theta_i > \max_{j\neq i}\theta_j). \end{equation*}
What we know about TTTS... (Complexity)
Definition Let $\Sigma_K = \{\omega : \sum_{k=1}^K \omega_k = 1, \omega_k \geq 0\}$ and define for all $i\neq I^\star$ \[ C_i(\omega,\omega') \triangleq \min_{x\in\mathcal{I}} \omega d(\mu_{I^\star};x) + \omega' d(\mu_i;x), \] where $d(\mu,\mu')$ is the KL-divergence. We define \begin{eqnarray*} \Gamma_{\beta}^\star \triangleq \max_{\substack{\omega \in \Sigma_K\\\omega_{I^\star}=\beta}}\min_{i\neq I^\star} C_i(\omega_{I^\star},\omega_i).\label{def:GammaBeta} \end{eqnarray*}
In particular, for Gaussian bandits... \[ \Gamma_{\beta}^\star = \max_{\omega:\omega_{I^\star}=\beta}\min_{i\neq I^\star} \frac{(\mu_{I^\star}-\mu_i)^2}{2\sigma^2(1/\omega_i+1/\beta)}. \]
What we want to know about TTTS...
Posterior convergence
Theorem 1 (Shang et al., 2020) Under TTTS, for Gaussian bandits with Gaussian priors, it holds a.s. \[ \lim_{n\rightarrow{\infty}} -\frac{1}{n}\log(1-\alpha_{n,I^\star}) = \color{red}{\Gamma_{\beta}^\star}. \]
Theorem 2 (Shang et al., 2020) Under TTTS, for Bernoulli bandits and uniform priors, it holds a.s. \[ \lim_{n\rightarrow{\infty}} -\frac{1}{n}\log(1-\alpha_{n,I^\star}) = \color{red}{\Gamma_{\beta}^\star}. \]
Sample complexity
Theorem 3 (Shang et al., 2020) The TTTS sampling rule coupled with the Chernoff stopping rule form a $\delta$-correct BAI strategy. If all arm means are distinct, \[ \limsup_{\delta\rightarrow{0}} \frac{\expectedvalue[\tau_{\delta}]}{\log(1/\delta)} \leq \frac{1}{\color{red}{\Gamma_{\beta}^\star}}. \]
Recall (Lower bound): \[\liminf_{\delta \rightarrow 0}\frac{\expectedvalue[\tau_\delta]}{\ln(1/\delta)} \geq \frac{1}{\color{red}{\Gamma^\star_\beta}}.\]
$\delta$-correctness
Stopping rule: \begin{equation} \tau_\delta^{\text{Ch.}} ≜ \inf \left\lbrace n \in \mathbb{N} : \max_{i \in \mathcal{A}} \min_{j \in \mathcal{A} \setminus \{i\} } \color{lightskyblue}{W_{n}(i,j)} > d_{n,\delta} \right\rbrace. \end{equation}
Transportation cost: $\mu_{n,i,j} ≜ (T_{n,i}\mu_{n,i} +T_{n,j}\mu_{n,j})/(T_{n,i}+T_{n,j})$, then we define \begin{equation}\label{def:Transportation} \color{lightskyblue}{W_n(i,j)} ≜ \left\{ \begin{array}{ll} 0 & \operatorname{if} \mu_{n,j} \geq \mu_{n,i},\\ W_{n,i,j}+W_{n,j,i} & \operatorname{otherwise}, \end{array}\right. \end{equation} where $W_{n,i,j} ≜ T_{n,i} d\left(\mu_{n,i},\mu_{n,i,j}\right)$ for any $i,j$.
In particular, for Gaussian bandits: \[ \color{lightskyblue}{W_n(i,j)} = \dfrac{(\mu_{n,i}-\mu_{n,j})^2}{2\sigma^2(1/T_{n,i}+1/T_{n,j})}\mathbb{1}\{\mu_{n,j}<\mu_{n,i}\}. \]
Theorem (Shang et al., 2020) The TTTS sampling rule coupled with the Chernoff stopping rule with a threshold $d_{n,\delta} \simeq \log(1/\delta) + c\log(\log(n))$ and the recommendation rule $J_t = \operatorname{arg\,max}_i \mu_{n,i}$, form a $\delta$-correct BAI strategy.
Bayesian stopping rule: \[\tau_{\delta} ≜ \inf \left\{ n\in\mathbb{N}:\max_{i\in\mathcal{A}} \color{limegreen}{\alpha_{n,i}} \geq c_{n,\delta} \right\}\,.\]
Sufficient condition for $\beta$-optimality
Lemma (Shang et al. 2020) Let $\delta,\beta\in (0,1)$. For any sampling rule which satisfies $\expectedvalue[\color{lightskyblue}{T_{\beta}^\epsilon}] < \infty$ for all $\epsilon > 0$, we have \[ \limsup_{\delta\rightarrow{0}} \frac{\expectedvalue[\tau_{\delta}]}{\log(1/\delta)} \leq \frac{1}{\Gamma_{\beta}^\star}, \] if the sampling rule is coupled with the Chernoff stopping rule.
$\color{lightskyblue}{T_{\beta}^\epsilon} ≜ \inf \left\{ N\in\mathbb{N}: \max_{i\in\mathcal{A}} \vert T_{n,i}/n-\omega_i^\beta \vert \leq \epsilon, \forall n \geq N \right\}$.
Core theorem
Theorem (Shang et al. 2020) Under TTTS, $\expectedvalue[\color{lightskyblue}{T_{\beta}^\epsilon}] < +\infty$.
The proof is inspired by Qin et al. (2017), but major technical novelties are introduced. In particular, our proof is much more intricate due to the randomized nature of the two candidate arms...
TTTS
T3C (Top-Two Transportation Cost)
Sample complexity
Theorem (Shang et al. 2020) The T3C sampling rule coupled with the Chernoff stopping rule form a $\delta$-correct BAI strategy. If all arm means are distinct, \[ \limsup_{\delta\rightarrow{0}} \frac{\expectedvalue[\tau_{\delta}]}{\log(1/\delta)} \leq \frac{1}{\color{red}{\Gamma_{\beta}^\star}}. \]
Time consumption
Sampling rule | T3C | TTTS | Uniform |
---|---|---|---|
Exec. time (s) | $1.6 \times 10^{-5}$ | $2.3 \times 10^{-4}$ | $6.0 \times 10^{-6}$ |
Average stopping time (Bernoulli)
Average stopping time (Gaussian)
We tackle hyper-parameter tuning for supervised learning tasks:
We see the problem as BAI in a stochastic infinitely-armed bandit: arm means are drawn from some reservoir distribution $\nu_0$
Goal: output an arm with mean close to $\mu^\star$
HPO as a BAI problem
BAI | HPO |
---|---|
query $\nu_0$ | pick a new configuration $\lambda$ |
sample an arm | train the classifier |
reward | cross-validation loss |
Dynamic Top-Two Thompson Sampling
In summary: in each round, query a new arm endowed with a Beta(1,1) prior, without sampling it, and run TTTS on the new set of arms
Order statistic trick
With $\mathcal{L}_{t-1}$ the list of arms that have been effectively sampled at time $t$, we run TTTS on the set $\mathcal{L}_{t-1} \cup \{\mu_0\}$ where $\mu_0$ is a pseudo-arm with posterior Beta($t-|\mathcal{L}_{t-1}|, 1$).
SiRI (Carpentier and Valko, 2015) as an example for infinite-armed bandits:
Hyperband (Li et al., 2017) as an example for HPO
Setting
Classifier | Hyper-parameter | Type | Bounds |
---|---|---|---|
SVM | C | $\mathbb{R}^+$ | $\left[ 10^{-5}, 10^{5} \right]$ |
$\gamma$ | $\mathbb{R}^+$ | $\left[ 10^{-5}, 10^{5} \right]$ | |
MLP | hidden layer size | Integer | $\left[ 5, 50 \right]$ |
$\alpha$ | $\mathbb{R}^+$ | $\left[ 0, 0.9 \right]$ | |
learning rate init | $\mathbb{R}^+$ | $\left[ 10^{-5}, 10^{-1} \right]$ |
Some results (I)
Some results (II)
In the linear case: