Combinatorial Pure Exploration with Partial or Full-Bandit Linear Feedback
In this paper, we propose the novel model of combinatorial pure exploration with partial linear feedback (CPE-PL). In CPE-PL, given a combinatorial action space X⊆{0,1}^d, in each round a learner chooses one action x ∈X to play, obtains a random (possibly nonlinear) reward related to x and an unknown latent vector θ∈R^d, and observes a partial linear feedback M_x (θ + η), where η is a zero-mean noise vector and M_x is a transformation matrix for x. The objective is to identify the optimal action with the maximum expected reward using as few rounds as possible. We also study the important subproblem of CPE-PL, i.e., combinatorial pure exploration with full-bandit feedback (CPE-BL), in which the learner observes full-bandit feedback (i.e. M_x = x^) and gains linear expected reward x^θ after each play. In this paper, we first propose a polynomial-time algorithmic framework for the general CPE-PL problem with novel sample complexity analysis. Then, we propose an adaptive algorithm dedicated to the subproblem CPE-BL with better sample complexity. Our work provides a novel polynomial-time solution to simultaneously address limited feedback, general reward function and combinatorial action space including matroids, matchings, and s-t paths.
READ FULL TEXT