PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

by   Tong Sang, et al.

Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics. In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation. In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) (Raileanu et al., 2020) network is trained to approximate the values for different combinations of policies and environments. In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF. Our experiments show that PAnDR outperforms existing algorithms in several representative policy adaptation problems.


page 18

page 19

page 20


Fast Adaptation via Policy-Dynamics Value Functions

Standard RL algorithms assume fixed environment dynamics and require a s...

Self-Supervised Policy Adaptation during Deployment

In most real world scenarios, a policy trained by reinforcement learning...

Offline Distillation for Robot Lifelong Learning with Imbalanced Experience

Robots will experience non-stationary environment dynamics throughout th...

Locally Constrained Policy Optimization for Online Reinforcement Learning in Non-Stationary Input-Driven Environments

We study online Reinforcement Learning (RL) in non-stationary input-driv...

Generalizing to New Physical Systems via Context-Informed Dynamics Model

Data-driven approaches to modeling physical systems fail to generalize t...

Experience Filter: Using Past Experiences on Unseen Tasks or Environments

One of the bottlenecks of training autonomous vehicle (AV) agents is the...

Cross apprenticeship learning framework: Properties and solution approaches

Apprenticeship learning is a framework in which an agent learns a policy...

Please sign up or login with your details

Forgot password? Click here to reset