The reinforcement learning (RL) approach is concerned with the learning process of a sequential decision-making policy based on the interactions between an agent and its environment Sutton2018. More precisely, the training is based on the rewards acquired by the agent from the environment as a consequence of its actions. Within this context, the objective is to identify the actions maximising the discounted sum of rewards, also named return. There exist multiple sound approaches based on the RL paradigm, and key successes/milestones have been achieved throughout the years of research. Nevertheless, these RL algorithms mostly rely on the expectation of the return, not on its complete probability distribution. For instance, the popular Q-learning methodology is based on the modelling of the
function, which can in fact be seen as an estimation of the expected returnWatkins1992.
While focusing exclusively on the expectation of the return has already proven to be perfectly sound for numerous applications, this approach however exhibits clear limitations for other decision-making problems. Indeed, some areas of application may also require to properly mitigate the risk associated with the actions taken Paduraru2021. One could for instance mention the healthcare Gottesman2019 and finance Theate2020 sectors, but also robotics in general Thananjeyan2021, and especially autonomous driving Zhu2022. Such a requirement for risk management may not only be true for the decision-making policy, but potentially also for the exploration policy during the learning process. In addition, properly taking into consideration the risk may be particularly convenient in environments characterised by substantial uncertainty.
The present research work suggests to take advantage of the distributional RL approach, belonging to the Q-learning category, in order to learn risk-sensitive policies. Basically, a distributional RL algorithm targets the complete probability distribution of the random return rather than only its expectation Bellemare2017
. This methodology presents key advantages. Firstly, it enables the learning of a richer representation of the environment, which may lead to an increase in the decision-making policy performance. Secondly, the distributional RL approach contributes to improve the explainability of the decision-making process, which is key in machine learning to avoid black-box models. Lastly and most importantly for this research work, it makes possible the convenient derivation of decision-making policies but also exploration strategies that are sensitive to the risk.
The core idea promoted by this research work is the use of the risk-based utility function as replacement of the popular function for action selection. In fact, it may be seen as an extension of the function taking into consideration the risk, which is assumed to be represented by the worst returns achievable by a policy. Therefore, the function is to be derived from the complete probability distribution of the random return , which is learnt by any distributional RL algorithm. The single modification to that RL algorithm to learn risk-sensitive policies is to employ the utility function rather than the expected return for both exploration and decision-making. This allows the presented approach to become a very practical and interpretable RL solution for achieving risk-sensitive decision-making.
2 Literature review
The core objective of the classical RL approach is to learn optimal decision-making policies without any concerns about the risk or safety Sutton2018. The resulting policies are said to be risk-neutral. Nevertheless, there are numerous real-world applications requiring to take into consideration the risk in order to ensure safer decision-making Paduraru2021. Two main approaches can be identified for achieving safe RL. Firstly, the optimality criterion can be modified so that a safety factor is included. Secondly, the exploration process can be altered based on a risk metric Garcia2015. These techniques give rise to risk-sensitive or risk-averse policies.
Scientific research on risk-sensitive RL has been particularly active for the past decade. Various relevant risk criteria have been studied for that purpose. The most popular ones are undoubtedly the mean-variance
mean-varianceCastro2012; Prashanth2013; Zhang2021 and the (Conditional) Value at Risk (CVaR) Rockafellar2001; Chow2015; Chow2017. Innovative techniques have been introduced for both policy gradient Tamar2015; Rajeswaran2017; Hiraoka2019 and value iteration Shen2014; Dabney2018; Tang2019; Urpi2021; Yang2022 approaches, with the solutions proposed covering both discrete and continuous action spaces. Additionally, risk-sensitive methodologies have also been studied in some niche sub-fields of RL, such as robust adversarial RL Pinto2017, but also multi-agent RL Qiu2021.
Focusing on the value iteration methodology, the novel distributional RL approach Bellemare2017 has been a key breakthrough, by giving access to the full probability distribution of the random return. For instance, Dabney2018
suggests to achieve risk-sensitive decision-making via a distortion risk measure. Applied on top of the IQN distributional RL algorithm, this is in fact equivalent to changing the sampling distribution of the quantiles. InTang2019, a novel actor-critic framework is presented, based on the distributional RL approach for the critic component. The latter work is extended to the offline setting in Urpi2021, since training RL agents online may be prohibitive because of the risk inevitably induced by exploration. One can finally mention Yang2022 that introduces the Worst-Case Soft Actor Critic (WCSAC) algorithm, which is based on the approximation of the probability distribution of accumulated safety-costs in order to achieve risk control. More precisely, a certain level of CVaR, estimated from the distribution, is regarded as a safety constraint.
In light of this literature, the novel solution introduced in this research paper presents key advantages. Firstly, the methodology proposed is relatively simple and can be applied on top of any distributional RL algorithm with minimal modification to the core algorithm. Secondly, the proposed approach enables to span the entire potential trade-off between risk minimisation and expected return maximisation. According to the user’s needs, the policy learnt can be risk-averse, risk-neutral or in between the two (risk-sensitive). Lastly, the solution presented contributes to improve the interpretability of the decision-making policy learnt.
3 Theoretical background
3.1 Markov decision process
Traditionally in RL, the interactions between the agent and its environment are modelled as a Markov decision process (MDP). An MDP is a 6-tuple where and respectively are the state and action spaces, is the probability distribution from which the reward is drawn given a state-action pair , is the transition probability distribution, is the probability distribution over the initial states , and is the discount factor. The RL agent makes decisions according to its policy , assumed deterministic, mapping the states to the actions .
3.2 Distributional reinforcement learning
In classical Q-learning RL, the core idea is to model the state-action value function of a policy . This important quantity represents the expected discounted sum of rewards obtained by executing an action in a state and then following a policy :
Key to the learning process is the Bellman equation Bellman1957, that the function satisfies:
In classical RL, the main objective is to determine an optimal policy which can be defined based on the optimal state-action value function as follows:
The optimal policy maximises the expected return (discounted sum of rewards). This research work later presents an alternative objective criterion for optimality in a risk-sensitive RL setting.
The distributional RL approach goes a step further by modelling the complete probability distribution over returns instead of only its expectation Bellemare2017, as illustrated in Figure 1. To this end, let the reward
be a random variable distributed under, the state-action value distribution of a policy is a random variable defined as follows:
where denotes the equality in probability distribution between the random variables and . Therefore, the state-action value function is the expectation of the random return . Equivalently to the expected case, there exists a distributional Bellman equation that recursively describes the random return of interest:
where is the transition operator. To end this section, one can define the distributional Bellman operator together with the distributional Bellman optimality operator as follows:
A distributional RL algorithm may be characterised by two core features. Firstly, both the representation and parameterisation of the random return probability distribution have to be properly selected. There exists multiple solutions for representing a unidimensional distribution: probability density function (PDF), cumulative distribution function (CDF), quantile function (QF). In practice, deep neural networks (DNNs) are generally used for the approximation of these particular functions. The second fey feature relates to the probability metric adopted for comparing two distributions, such as the Kullback-Leibler (KL) divergence, the Cramer distance or the Wasserstein distance. More precisely, the role of the probability metric in distributional RL is to quantitatively compare two probability distributions of the random return so that a temporal difference (TD) learning method is applied, in a similar way to the mean squared error between Q-values in classical RL. The probability metric plays an even more important role as different metrics offer distinct theoretical convergence guarantees for distributional RL.
4.1 Objective criterion for risk-sensitive RL
As previously explained, the objective in classical RL is to learn a decision-making policy that maximises in expectation the discounted sum of rewards. Formally, this objective criterion can be expressed as the following:
In order to effectively take into consideration the risk and value its mitigation, this research work presents an update of the former objective. In fact, coming up with a generic definition for the risk is not trivial since the risk is generally dependent on the decision-making problem itself. In the present research work, it is assumed that the risk is assessed on the basis of the worst returns achievable by a policy . Therefore, a successful decision-making policy should ideally maximise the expected discounted sum of rewards while avoiding low values for the worst case returns. The latter requirement is approximated with a new constraint attached to the former objective defined in Equation 10: the probability of having the policy achieving a return lower than a certain minimum value should not exceed a given threshold. Mathematically, the alternative objective criterion proposed for risk-sensitive RL can be expressed as follows:
denotes the probability of the event ,
is the minimum acceptable return (from the perspective of risk mitigation),
is the threshold probability to not exceed.
4.2 Practical modelling of the risk
As previously hinted, this research work assumes that the risk associated with an action is related to the worst achievable returns when executing that particular action and then following a certain decision-making policy . In such a context, the distributional RL approach becomes particularly interesting, by providing access to the full probability distribution of the random return . Thus, the risk can be efficiently assessed by examining the so-called tail of the learnt probability distribution. Moreover, the new constraint in Equation 11 can be approximated through popular risk measures such as the Value at Risk and Conditional Value at Risk. Illustrated in Figure 2, these two risk measures are formally expressed as follows:
where represents the CDF of the random return .
More generally, this research work introduces the state-action risk function of a decision-making policy , which is the equivalent of the function for the risk. More precisely, that function quantifies the riskiness of the discounted sum of rewards obtained by executing an action in a state and then following a policy :
is a function extracting risk features from the random return probability distribution , such as or ,
is a parameter corresponding to the cumulative probability associated with the worst returns, generally between and . In other words, this parameter controls the size of the random return distribution tail from which the risk is estimated.
4.3 Risk-based utility function
In order to pursue the objective criterion defined in Equation 11 for risk-sensitive RL, this research work introduces a new concept: the state-action risk-based utility function. Denoted , the utility function assesses the quality of an action in a certain state , in terms of expected performance and risk, assuming that the policy is followed afterwards. In fact, the intent is to extend the popular function so that the risk is taken into consideration, by taking advantage of the risk function defined in Section 4.2. More precisely, the utility function is built as a linear combination of the and functions. Formally, the risk-based utility function of a policy is defined as the following:
where is a parameter controlling the trade-off between expected performance and risk. If , the utility function will be maximised with a fully risk-averse decision-making policy. On the contrary, if , the utility function degenerates into the function quantifying the performance on expectation. Figure 3 graphically describes the utility function , which moves along the x-axis between the quantities and when modifying the value of the parameter .
4.4 Risk-sensitive distributional RL algorithm
In most applications, the motivation for choosing the distributional RL approach over the classical one is related to the improved expected performance that results from the learning of a richer representation of the environment. Despite having access to the full probability distribution of the random return, only the expectation is exploited to derive decision-making policies:
Nevertheless, as previously hinted, the random return does also contain valuable information about the risk, which could be exploited to learn risk-sensitive decision-making and exploration policies. The present research work suggests to achieve risk-sensitive distributional RL by maximising the utility function , derived from , instead of the expected return when selecting actions. This alternative operation would be performed during both exploration and exploitation. Even though maximising the utility function is not exactly equivalent to the optimisation of the objective criterion defined in Equation 11, it is a relevant step towards risk-sensitive RL. Consequently, a risk-sensitive policy can be derived as follows:
Throughout the learning phase, a classical Q-learning algorithm is expected to progressively converge towards the optimal value function that naturally arises from the optimal policy . In a similar way, the proposed risk-sensitive RL algorithm jointly learns the optimal policy and the optimal state-action risk-based utility function . More formally, the latter two are mathematically defined as the following:
The novel methodology proposed by this research work to learn risk-sensitive decision-making policies based on the distributional RL approach is summarised as follows. Firstly, select any distributional RL algorithm that learns the full probability distribution of the random return . Secondly, leave the learning process unchanged except for action selection, which involves the maximisation of the utility function rather than the expected return . This adaptation is the single change to the distributional RL algorithm, occurring at two different locations within the algorithm: i. the generation of new experiences by interacting with the environment, ii. the learning of the random return based on the distributional Bellman equation. However, this adaptation has no consequence on the random return itself learnt by the distributional RL algorithm, only on the actions derived from that probability distribution. Algorithm 1 details the proposed solution in a generic way, with the required modifications highlighted.
5 Performance assessment methodology
5.1 Benchmark environments
This research work introduces some novel benchmark environments in order to assess the soundness of the proposed methodology to design risk-sensitive policies based on the distributional RL approach. These environments consist of three toy problems that are specifically designed to highlight the importance of taking into consideration the risk for a decision-making policy. More precisely, the control problems are built in such a way that the optimal policy will differ depending on whether the objective is to solely maximise the expected performance or to also mitigate the risk. This is achieved by including relevant stochasticity in both the state transition function and reward function . Moreover, the benchmark environments are designed relatively simple in order to ease the analysis and understanding of the decision-making policies learnt. This simplicity also ensures the accessibility of the experiments, since distributional RL algorithms generally require a considerable amount of computing power. Figure 4 illustrates these three benchmark environments, and highlights the optimal paths to be learnt depending on the objective pursued. For the sake of completeness, a thorough description of the underlying MDPs is provided in A.
The first benchmark environment presented is named risky rewards. It consists of a grid world within which an agent has to reach one of two objective areas, that are equidistant from its fixed initial location. The difficulty of this control problem lies in the choice of the objective area to target, because of the stochasticity present in the reward function. Reaching the first objective area yields a reward with a lower value in expectation and a limited deviation from that average. On the contrary, reaching the second objective location yields a reward that is higher in expectation, at the cost of an increased risk.
The second benchmark environment studied is named risky transitions. It consists of a grid world within which an agent has to reach one of two objective areas as quickly as possible, in the presence of a stochastic wind. The agent is initially located in a fixed area that is very close to an objective, but the required move to reach it is in opposition to the wind direction. Following that path results in a reward that is higher in expectation, but there is a risk to be repeatedly countered by the stochastic wind. On the contrary, the longer path is safer but yields a lower reward on average.
The last benchmark environment presented is named risky grid world. This control problem can be viewed as a combination of the two environments previously described since it integrates both stochastic rewards and transitions. It consists once again of a grid world within which an agent, initially located in a fixed area, has to reach a fixed objective location as quickly as possible. To achieve that goal, three paths are available. The agent may choose the shortest path to the objective location that is characterised by a stochastic trap, or get around this risky situation by taking a significantly longer route. This bypass can be done from the left or from the right, another critical choice in terms of risk because of the stochastic wind. Once again, the optimal path is therefore dependent on the objective criterion to pursue.
5.2 Risk-sensitive distributional RL algorithm analysed
The distributional RL algorithm selected to assess the soundness of the methodology introduced for learning risk-sensitive decision-making policies is the Unconstrained Monotonic Deep Q-Network with Cramer (UMDQN-C) Theate2021. Basically, this particular distributional RL algorithm models the CDF of the random return in a continuous way by taking advantage of the Cramer distance for deriving the TD-error. Moreover, the probability distributions learnt are ensured to be valid thanks to the specific architecture exploited to model the random return: Unconstrained Monotonic Neural Network (UMNN) Wehenkel2019. The latter has been demonstrated to be a universal approximator of continuous monotonic functions, which is particularly convenient for representing CDFs. In practice, the UMDQN-C algorithm has been shown to achieve great results, both in terms of policy performance and in terms of random return probability distribution quality. This second feature clearly motivates the selection of this specific distributional RL algorithm to conduct the following experiments, since accurate random return probability distributions are required to properly estimate the risk. The reader can refer to the original research paper Theate2021 for more information about the UMDQN-C distributional RL algorithm.
As previously explained in Section 4.2, the approach presented requires the choice of a function for extracting risk features from the random return probability distribution . In the following experiments, the Value at Risk (VaR) is adopted to estimate the risk. This choice is motivated by both the popularity of that risk measure in practice and by the efficiency of computation. Indeed, this quantity can be conveniently derived from the CDF of the random return learnt by the UMDQN-C algorithm.
In the next section presenting the results achieved, the risk-sensitive version of the UMDQN-C algorithm is denoted RS-UMDQN-C. The detailed pseudo-code of that new risk-sensitive distributional RL algorithm can be found in B.
To conclude this section, ensuring the reproducibility of the results in a transparent way is particularly important to this research work. In order to achieve that, Table 1
provides a brief description of the key hyperparameters used in the experiments. Moreover, the entire experimental code is made publicly available at the following link:https://github.com/ThibautTheate/Risk-Sensitive-Policy-with-Distributional-Reinforcement-Learning.
|Deep learning optimiser epsilon||-|
|Replay memory capacity|
|Target update frequency|
|Random return resolution|
|Random return lower bound|
|Random return upper bound|
|Exploration -greedy initial value||-|
|Exploration -greedy final value||-|
|Exploration -greedy decay||-|
6.1 Decision-making policy performance
To begin with, the performance achieved by the decision-making policies learnt has to be evaluated, both in terms of expected outcome and risk. For comparison purposes, the results obtained by the well-established DQN algorithm, a reference without any form of risk sensitivity, are presented alongside those of the newly introduced RS-UMDQN-C algorithm. It shall be mentioned that these two RL algorithms achieve very similar results when risk sensitivity is disabled ( for the RS-UMDQN-C algorithm), as expected.
In the following, two analyses are presented. Firstly, the probability distribution of the cumulative reward of a policy , denoted , is investigated. More precisely, the expectation , the risk function and the utility function of that random variable are derived for each algorithm and compared. Secondly, this research work introduces a novel easy to interpret performance indicator for evaluating the risk-sensitivity of the decision-making policies learnt, by taking advantage of the simplicity of the benchmark environments presented in Section 5.1. In fact, it is made possible by the easy assessment from a human perspective of the relative riskiness of a path in the grid world environments studied. If the optimal path in terms of risk is chosen (green arrows in Figure 4), a score is awarded. On the contrary, the riskier path but optimal in expectation (orange arrows in Figure 4) yields a score . If no objective nor trap areas are reached within the time allowed, a score is delivered. Consequently, the evolution of this performance indicator provides valuable information about the convergence of the RL algorithms towards the different possible paths as well as about the stability of the learning process. Formally, let with and be a trajectory defined over a time horizon (ending with a terminal state, and subject to an upper bound), and let and respectively be the sets of trajectories associated to the green and orange paths in Figure 4. Based on these definitions, the risk-sensitivity of a policy is a random variable that can be assessed via Monte Carlo as the following:
The first results on policy performance are summarised in Table 2, which compares the decision-making policies learnt by the DQN and RS-UMDQN-C algorithms both in terms of expected outcome and risk. The second results on policy performance are compiled in Figure 5 plotting the evolution of the risk-sensitivity performance indicator during the training phase. It can be clearly observed from these two analyses that the proposed approach is effective in learning decision-making policies that are sensitive to the risk for relatively simple environments. As expected, the DQN algorithm yields policies that are optimal in expectation whatever the level of risk incurred. In contrast, the RS-UMDQN-C algorithm is able to leverage both expected outcome and risk in order to learn decision-making policies that produce a slightly lower expected return with a significantly lower risk level. This allows the proposed methodology to significantly outperform the risk-neutral RL algorithm of reference with respect to the performance indicator of interest in Table 2. Finally, it is also encouraging to observe from Figure 5 that the learning process seems to be quite stable for simple environments, despite having to maximise a much more complicated function.
|Risky grid world||0.347||-1.03||-0.342||0.333||0.018||0.175|
6.2 Probability distribution visualisation
A core advantage of the proposed solution is the improved interpretability of the resulting decision-making process. Indeed, understanding and motivating the decisions outputted by the learnt policy is greatly facilitated by the access to the probability distributions of the random return jointly learnt. In addition, the analysis and comparison of the value, risk and utility functions (, and ) associated with different actions provide a valuable summary about the decision-making process, but also about the control problem itself. Such an analysis may be particularly important to correctly tune the risk trade-off parameter according to the user’s risk aversion.
As an illustration, Figure 6 demonstrates some random return probability distributions that are learnt by the RS-UMDQN-C algorithm. More precisely, a single relevant state is selected for analysis for each benchmark environment. The selection is based on the importance of the next decision in following a clear path, either maximising the expected outcome or mitigating the risk. Firstly, it can be observed that the risk-sensitive distributional RL algorithm does manage to accurately learn the probability distributions of the random return, qualitatively from a human perspective. In particular, the multimodality purposely designed to create risky situations appears to be well preserved. Such a result is particularly encouraging since this feature is essential to the success of the proposed methodology. Indeed, it ensures the accurate estimation of the risk as defined in Section 4.2. This observation is in line with the findings of the research paper Theate2021 introducing the UMDQN algorithm, and suggests that the solution introduced to achieve risk-sensitivity does not alter too much the properties of the original distributional RL algorithm. Secondly, as previously explained, Figure 6 highlights the relevance of each function introduced (, and ) for making and motivating a decision. Their analysis truly contributes to the understanding of the potential trade-off between expected performance maximisation and risk mitigation for a given decision-making problem, as well as the extent to which different values of the important parameter leads to divergent policies.
The present research work introduces a straightforward yet efficient solution to learn risk-sensitive decision-making policies based on the distributional RL approach. The proposed methodology presents key advantages. Firstly, it is perfectly compatible with any distributional RL algorithm, and requires only minimal modification to the original algorithm. Secondly, the simplicity of the approach contributes to the interpretability and ease of analysis of the resulting risk-sensitive policies, a particularly important feature to avoid black-box machine learning models. Lastly, the solution presented allows to cover the complete potential trade-off between expected outcome maximisation and risk minimisation. The first experiments performed on three relevant toy problems yield promising results, which may be viewed as a proof of concept for the accessible and practical solution introduced.
Some interesting leads can be suggested as future work. Firstly, the research conducted is exclusively empirical and does not study any theoretical guarantees about the resulting risk-sensitive distributional RL algorithms. Among others, the study of the convergence of these algorithms would be a relevant future research direction. Secondly, building on the promising results achieved, the solution presented should definitely be evaluated on more complex environments, for which the risk should ideally be mitigated. Lastly, the approach could be extended to not only mitigate the risk but also to completely discard actions that would induce an excessive level of risk, in order to increase compliance with the objective criterion originally defined in Section 4.1.
Thibaut Théate is a Research Fellow of the F.R.S.-FNRS, of which he acknowledges the financial support.
Appendix A Benchmark environments
Risky rewards environment
The underlying MDP can be described as follows:
, a state being composed of the two coordinates of the agent within the grid,
, with an action being a moving direction,
and if the agent reaches the first objective location (terminal state),
and with a 75% chance, and and with a 25% chance if the agent reaches the second objective location (terminal state),
associates a 100% chance to move once in the chosen direction, while keeping the agent within the grid world (crossing a border is not allowed),
Risky transitions environment
The underlying MDP can be described as the following:
, a state being composed of the two coordinates of the agent within the grid,
, with an action being a moving direction,
and if the agent reaches one of the objective locations (terminal state),
associates a 100% chance to move once in the chosen direction AND a 50% chance to get pushed once to the left by the stochastic wind, while keeping the agent within the grid world,
Risky grid world environment
The underlying MDP is described as follows:
, a state being composed of the two coordinates of the agent within the grid,
, with an action being a moving direction,
and if the agent reaches the objective location (terminal state),
and with a 75% chance, and and with a 25% chance if the agent reaches the stochastic trap location (terminal state),
associates a 100% chance to move once in the chosen direction AND a 25% chance to get pushed once to the left by the stochastic wind, while keeping the agent within the grid world,