H I -V AL : Iterative Learning of Hierarchical Value Functions for Policy Generation

. Task decomposition is effective in various applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) un-predictabilities and modeling limitations typically prevent the manual speciﬁca-tion of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Conﬁdence Tree (UCT) search to introduce H I -V AL , a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, H I -V AL simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot.


Introduction
Generating effective action policies is impractical in various applications that are characterized by large state spaces and require strong generalization capabilities. Many techniques tackle the curse-of-dimensionality problem by using expert demonstrations to initialize agents' behaviors and guide the learning process. However, this is not feasible when a reduced number of examples is available, and a direct mapping between the expert's and the agent's action space is difficult to obtain (e.g. highly redundant robots). While generalization is typically achieved by means of function approximation (e.g., using neural networks), solely relying on this and excluding prior knowledge can be inefficient, slow and even dangerous in multiple applications, such as robotics. Hence, Monte-Carlo tree search algorithms, and in particular the Upper Confidence Tree (UCT) algorithm [13], are widely used to exploit prior knowledge [21] in exploring search space. Nonetheless, they show limitations in generalizing among related states.
We build upon these state-of-the-art techniques to directly learn action values from experience, and accordingly learn a policy by using UCT with focused exploration. To this end, we introduce HI-VAL, a novel iterative algorithm for learning hierarchical value functions that are used to (1) capture multi-layered action semantics, (2) generate policies by scaffolding the acquired knowledge, and (3) guide the exploration of the state space. HI-VAL improves the UCT algorithm and builds upon concepts from previous literature, such as Hierarchical Task Networks (HTNs) [8], semi-MDPs [23] and MAX-Q decompositions [7], to decompose the learning procedure and to generate both action abstractions and search space constraints. The action hierarchy formalized by HI-VAL is learned iteratively by evaluating state-actions pairs generated by UCT after each episode. Fig. 1 shows an example of such a hierarchy where states and actions are associated at different layers of abstraction. HI-VAL assigns states and actions to different clusters s c l and a c l by evaluating the similarity of successor states that the agent can reach, by applying the actions in a c l in the states contained in s c l . Intuitively, similar successor states have similar reward values and can be evaluated altogether when exploring the search space. Different layers provide different granularity of action semantics (the higher the more coarse) and help the learning process to evaluate states hierarchically. HI-VAL runs UCT to explore the environment by sampling the joint distribution of rewards and state-action pairs. Each sample is continuously aggregated into a dataset, that is used to estimate -by means of Q-learning -the value function Q λ of each layer λ in the hierarchy. Specifically, at each layer, Monte Carlo search is ran for a subset of actions that are evaluated according to their Q-value, thus driving the node-expansion phase during episode simulation.
In this work, we aim at demonstrating that Q-values can be learned hierarchically to influence exploration, and to represent action semantics at different levels of abstraction, thus linking learning techniques to low-level agent controls. Our main contributions consist in (1) a novel integration of Monte-Carlo tree search, hierarchical planning and Q-learning, that enables good performance with selective state exploration and improved generalization capabilities, in (2) a two-sided extension of TD-search, that not only executes on multiple hierarchy layers, but also constructs upper confidence bounds on the value functions -and selects actions optimistically with respect to those -and in (3) a reduction of the curse-of-dimensionality that is obtained by means of focused exploration. We evaluated the HI-VAL performance in two different scenarios, an "objectfetching" task with a 7-DOF KUKA light weight arm and, a "pick and delivery" task in a simple environment with a Pioneer robot, where the agent has to collect an item and delivering it reduction in the number of states explored -which makes the method more suitable in robotic applications -and, the ability of HI-VAL to represent action semantics through its hierarchy in order to boost the learning process.

Related Work
Policy learning is widely adopted to generate practical behaviors in several applications. This is true for complex domains, such as robotics [4], where unstructured environments and uncertain dynamics through handcrafted policies -that typically fail or must be refined [14]. Although designing effective policies is impractical in most of these scenarios, and learning techniques are typically demanding and time consuming [11] for problems with large state spaces. The computational demand can be alleviated by initializing a policy with expert demonstrations, that restrict the learning process to a promising hypothesis space [12,20]. However, to apply imitation learning in complex domains, a large dataset of good-quality expert demonstrations is generally required, that can be efficiently mapped to the agent's action space. Unfortunately, this is not always possible due to the lack of (1) domain experts, (2) practical ways of providing demonstrations, and (3) action mappings from experts to agents (e.g. hyper-redundant robots).
To overcome these difficulties, we propose an approach that does not require expert demonstrations to initialize agent behaviors, and decomposes the learning procedure to generate action abstractions and search space constraints. In literature, multiple authors exploit the notions of skills and semi Markov Decision Processes (semi-MDPs), and define hierarchical representations such as options [23] and MAX-Q decompositions [7]. Unfortunately, applications of these methods in complex domains like robotics are limited, and prior knowledge has to be enforced in the learning process by means of expert demonstrations. In fact, although hierarchical learning and value function approximations techniques have been adopted in several applications, state-of-the-art techniques still show considerable margin of improvement. For example, [19] provide a better policy generalization by exploiting the concept of Generalized Value Functions, to improve value function approximation. In a different settings, [6] use expert demonstrations to learn high-level tasks as a combination of action-primitives. Unfortunately, these approaches only learn specific hierarchical structures, that poorly generalize and cannot profit from the expressiveness of value functions. Similarly, [22] apply hierarchical learning to sequences of motion primitives on a pick-and-place task with a hyper-redundant robotic arm. [15] initialize skill trees from human demonstrations, improving them over time. However, their representations use expert demonstrations and do not represent action on higher levels of abstractions. Conversely, [10] apply hierarchical policy learning to solve a 2-DOF stand-up task for a robotic arm. They exploit Q-learning and actor-critic methods to learn both task decompositions and local trajectories that solve specific sub-goals. Alternatively, [9] and [2] formalize action hierarchies to represent actions at different levels of abstraction. However, these procedures are not easily scalable to higher dimensionality problems.
Motivated by our discussion, we extend our previous work [17,16] to formalize action hierarchy by introducing HI-VAL, an iterative algorithm that learns hierarchical value functions to drive the policy search, and that achieves good generalization with a focused state exploration without the aid of expert demonstrations. Specifically, we enable HI-VAL to (1) learn a hierarchical value function directly from experience and (2) simultaneously use learned values during exploration, to generate a competitive policy. As in Hierarchical Task Networks, value functions are used to plan both over compound and primitive action spaces, whereas compound actions have specific implementations that depend on the state of the environment. Like [21], we exploit TD (temporal difference) methods to learn action values during a Monte-Carlo tree search. In this way, on the one hand, we are able to learn Q-values, and on the other hand, we adopt Q-values to support decision-theoretic planning for generalization and exploration at multiple levels of abstraction. Differently from [21], in fact, not only we improve our model by preserving the selective search of Monte-Carlo algorithms when bootstrapping is ongoing, but we also generate action and state abstractions. Specifically, HI-VAL extends TDsearch by constructing upper confidence bounds on a hierarchy of value functions, and by selecting optimistically with respect to those. As the experimental evaluation shows in both scenarios, HI-VAL generates competitive policies -with the additional benefit of a reduction in (1) number of simulations (or expanded node), and (2) exploration of the search space -that alleviates the curse-of-dimensionality.

HI-VAL
Formulation. HI-VAL is an iterative algorithm that, at each iteration i, generates a new policy π i which improves π i−1 [17]. To obtain an improved π i , our algorithm leverages (1) data aggregation [18], and (2) Upper Confidence Bounds for Trees (UCT) [13], a variant of Monte-Carlo Tree Search that adopts an upper confidence bound strategy -UCB1 [3] -for balancing between exploration and exploitation on the tree. To describe HI-VAL we adopt the Markov Decision Process (MDP) notation, in which the decisionmaking problem is represented as a tuple M DP = (S, A, T , R, γ), where S is the set of discrete states of the environment, A represents the set of discrete actions, T : S × A × S → [0, 1] is a stochastic transition function that models the probabilities of transitioning from state s ∈ S to s ∈ S when taking action a ∈ A, R : S × A → R is the reward function, and γ is a discount factor in [0, 1).
Function Approximation. HI-VAL addresses the generalization problem by relying on previous literature [5]. We choose to approximate the Q function using probability densities in the form of a mixture of K Gaussians (i.e., Gaussian Mixture Models -GMMs). We integrate the approach in [1] with a data aggregation [18] procedure, where a dataset of samples is iteratively collected and aggregated. Specifically, at each iteration i, the V values Q i are determined according to the Q-learning update rule where α is the learning rate,Q is the function approximation learned at previous iteration, and Q 0 (s, a j ) = 0. As we discuss later, the function approximationQ is learned over an aggregated dataset D 0:i = {∪D d |d = 0 . . . i}.

Exploration and Sample Collection
At every step h, for h = 1 . . . H, UCT simulates the execution of all the actions A L ⊆ A that are "admissible" in s h , as detailed in next section. Specifically, each simulation executes an action a ∈ A L , followed by K roll-outs, that run an -greedy policy based on π i−1 until a terminal state is reached. The best action a h is selected according to where C is a constant that multiplies and controls the exploration term e, and η(s h , a) is the number of occurrences of a in s h . Since we assume a discrete state space S, for continuous problems we define a similarity operator that informs the algorithm whether the difference of two states is smaller than a given threshold ξ -thus discretizing the space.
During each roll-out: a dataset D i of samples x = (s, a, Q i (s, a)) is collected to improve our estimatê Q, as detailed in previous section; -HI-VAL uses UCT as an expert and collects a dataset D π,uct of H samples x = (s h , a * h , s h ) that are selected by the tree search similarly to DAgger [18], D π,uct is aggregated into a dataset D π,i = D π,uct ∪ D π,i−1 .
When H UCT steps are run, D π,i is used to generate hierarchy clusters -as detailed in the next section -and learn a new policy π i . Our algorithm uses a discovery process that is supported by the UCT search and policy.

Hierarchical Action Selection
The hierarchical model adopted in HI-VAL builds upon the concepts of High-Level Actions (HLAs) and Reachable Sets (RSs) in HTNs [8].
-HLAs are defined recursively as a sequence of action primitives and/or other HLAs. When a HLA is composed by only primitives, such sequence is called "implementation". Using similar concepts, we allow HI-VAL to evaluate actions at multiple levels of abstraction. Specifically, a hierarchy of actions is obtained using an agglomerative clustering algorithm, which is ran on the set of next states {s } present in the D π,i . Our key assumption is that similar s encode information about actions with similar effects and thus can be clusterized altogether. Particulary, we refer to H as the set of layers in the action hierarchy generated by the clustering algorithm. The clustering routine organizes {s } in clustersŝ over a predefined number of layers, which are then tranferred into the action space, generating a set of action clustersâ organized along the same structure. Such a mapping is realized by evaluating each element contained in theŝ clusters and backpropagated to the original dataset D π,i in order to retrieve the set of actions generating the transitions to the {s } states. In fact, dataset elements are a tuple of three components encoding a transition from the current state s to the next state s by means of the action a. Fig. 1 shows a simplistic example of such hierarchies.
In our algorithm, each action clusterâ corresponds to a HLA in the layer λ ∈ H, and each layer has an associated Q λ function, approximated asQ λ . The result of choosing an actionâ consists of selecting a cluster of lower level actions with a similar expectation to reach a desired clusterŝ . Noticeably, such a model intrinsically connects to the concept of reachable sets. Clustersŝ are in fact an approximation of optimistic sets RS + , and they evaluate actionsâ that lead to more rewarding states altogether.
HI-VAL uses each Q λ in s to select the set A L of "admissible" actions for UCT. Intuitively, a primitive action a is admissible in s if, for each layer λ, a belongs to the clusterâ λ selected according to Q λ . More formally: where σ 2 is the standard deviation of the regression approximation [1]. Through δ λ s,â , the prediction error for each action abstraction is captured, leading to a more directed exploration of the action hierarchy. Action primitives are finally chosen and executed by UCT according to the lowest-levelQ, as detailed in previous section. To obtain a less biased exploration and avoid value function over-fitting, inadmissible actions are anyway expanded and selected by UCT with a 30% probability. .
Train classifier πi on Dπ,i end return πN end

HI-VAL Algorithm
The goal of HI-VAL consists of iteratively updating each layer's value function approx-imationQ λ , to generate a policy π i that maximizes the expected reward of the agent. The underlying insight of HI-VAL is that, while exploring the search space, collected state-action pairs are used at each iteration i to (1) update the approximated Q functions for refining the policy π i−1 into a policy π i , and (2) use Q-values to influence UCT exploration in accordance with Eq. 3. The complete HI-VAL algorithm -described in Alg. 1 -for each iteration proceeds as follows: 1. Roll-in. The agent follows the previous policy π i−1 and generates a set of s t states for T timesteps. 2. UCT search. For each of the generated states s t , HI-VAL runs an UCT search with horizon H. At each step h, UCT simulates the execution of every "admissible" action in the set A L , computed according to Eq. 3. For each action a ∈ A L , a simulation consists of the execution of a, followed by K -greedy roll-outs based on π i−1 , which are used to estimate the Q-values of each visited state. Finally, for each step, the best action a * h is (1) chosen according to Eq. 2 and (2) aggregated into a dataset D uct together with s h and s h+1 . It is worth remarking that a vanilla implementation of the UCT search evaluates all possible actions and explores a significant amount of states to generate an effective policy. Our approach, instead, leverages the hierarchical structure of H to generate a restricted subset of "admissible" actions, with high estimated Q-value. This efficiently reduces the exploration phase by guiding the algorithm to discard actions that are not expected to improve π i−1 . 3. Hierarchical data aggregation andQ update. After UCT, new data D uct = {(s t+h , a * t+h , s t+h+1 ) | h = 1 . . . H} is available to be aggregated into a larger dataset D π,i . This dataset is used to generate clustersŝ,â, andŝ in two steps: first, the sets of states {s} and {s } in D π,i are separately clustered within λ layers; then, the hierarchy of next state clustersŝ is transferred into the action space to generate the action clustersâ, each of them corresponding to high-level actions. In order to correctly update theQ λ estimation for everyâ, samples of the form x λ = (ŝ,â, Q λ (s,â)) are generated for each layer λ, with Q λ (s,â) determined as in Eq. (1). Such samples are then (1) aggregated into a dataset D λ,0:i and (2) used to improve the estimateQ λ , as described in previous sections. Specifically,Q λ is updated for each (ŝ,â) containing the state-action pairs (s t+h , a * t+h ) ∈ D uct , by using the averaged reward R(s t+h , a * t+h ) of the corresponding state-action pairs in the clusters (ŝ,â). 4. Training. Once data aggregation has been performed, a new policy π i is trained from the dataset D π,i (e.g., using a classifier).

Experimental Evaluation
We evaluate our approach in generating a policy for executing a "fetching" task, and a "pick and delivery" task in a simple environment [7], with a reduced number of stateaction pairs explored by UCT. We compare our results with a random-UCT and vanilla-UCT implementations, the TD-search [21] algorithm and different configurations of the HI-VAL action hierarchy. We will refer to as VAL as a basic implementation of the action hierarchy composed solely by a single layer of the primitive actions. Then, the random-UCT algorithm selects a random action at each h step of the Monte Carlo search, while in the vanilla-UCT all actions are considered "admissible" at each step. In all algorithms, we implement a shaped reward function that computes reward values at each visited state. We deploy these algorithms within a simulated environment on a 7-DOF KUKA light weight arm and, on a Pioneer robot. Experiments have been conducted within the V-REP simulation environment, using a single Intel i7-5700HQ core, with CPU@2.70GHz and 16GB of RAM. For both the scenarios, the UCT search has been configured as follows: (1) search horizon H = 4; (2) exploration constant C = 0.707; (3) K = 3 roll-outs. The number of components of the GMMs is evaluated according to the BIC criterion, which has been tested using up to 6 Gaussians. Q-values are updated with a learning rate α = 0.1, and a discounted  factor γ = 0.8. The algorithm is ran for T = 15 timesteps at each iteration. Moreover, stochastic actions are induced by randomizing the outcome of an action with a 5% probability.

Fetching task
In this scenario the state of the problem is represented as a 7-feature vector, where 3 components represent the distance of the robot end-effector to the target, 3 components encode the distance to an obstacle introduced in the scene, and the last component is the angle difference between the end-effector and world axis Z. We include such component to bias the agent in planning to fetch an object with a preferable orientation. The reward function is a weighted sum of such components, and it is designed to promote states that are far from the obstacle, and close to the target position. Additionally, it penalizes states in which the end-effector does not point upwards, to simulate objects that have to be held with a preferred orientation (e.g. a glass full of water). The robot explores an action space composed by 13 actions: 6 translation actions to move the arm back and forth along the Cartesian axes, and 6 rotation actions to move the arm counter-/clockwise on the Roll, Pitch and Yaw angles. A no-op action is introduced to let the robot in its state. Fig. 2 and Fig. 3 illustrate obtained results by reporting the average cumulative reward and the number of explored states obtained during 10 iterations. In detail, reward values are averaged over 10 simulated fetching trials for each of the iterations and for each algorithm, the continuous lines represent average cumulative rewards while the line width their standard deviation. In the explored state plots, the gray top part of each bar highlights the amount of states expanded during i with respect to the total number of states explored until i − 1. While baseline algorithms perform worse in terms of obtained rewards (random-UCT, TD-search), only the vanilla-UCT shows results that are comparable to HI-VAL. However, the number of explored states of vanilla-UCT is significantly higher. Specifically, the naive implementation of UCT evaluates more than two times the number of states that HI-VAL (∼55%). In fact, HI-VAL approximates the optimistic set of a HTN by evaluating only "admissible" actions that are expected to lead the search towards states with high reward. This is achieved by exploiting the action hierarchy updated at each iteration. Moreover, the results compare two different configurations of HI-VAL. The first, VAL, is organized as a single layer structure, where the number of clusters within the layer is equal to the number of primitive actions, while the latter configuration, HI-VAL, is organized over 2 layers where the first layer also contains the set of primitive actions and, the second layer groups actions in 5 clusters. Again HI-VAL further reduces the number of explored states and confirms that a hierarchical evaluation of the search space improves the learning process. In this scenario, we do not notice a significant improvement between HI-VAL and VAL since, differently from the "pick and delivery" task, the structure of "fetching" is not hierarchical. However, we aim at showing that increasing the number of layers in the representation, even when that is not needed, does not damage the obtained performance and, still, slightly decreases the number of visited states.

Pick and delivery task
Here the environment is represented as a 5x5 grid-world where the Pioneer has to collect an object at a random location and, carry it to an operator. The scenario resembles the one addressed by the "taxi-agent" in [7], however a comparison with max-Q would not be proper since our reward is implemented to be shaped and not sparse; and we implement our approach in a robotic context where the reduced number of samples and iterations are limiting constraints. Here, the state is a 9-feature vector where the first two components represent the position of the robot, the following two encode the current target of the robot (either the object station or the delivery one), the fifth component  indicates whether the object is picked, and the last four components indicate whether there is an obstacle in one of the four possible directions (e.g. wall). The action space of the agent in composed by 6 actions, four to move through cells, and two to pick and drop the object. Moreover, we assume that a robot is helped to collect and drop the objects by an external operator. The reward function is a weighted sum of two components encoding the distance of the robot to the target object and the distance of the object to its delivery station. This task represents a more complex scenario due to temporal constraints imposed by the status of the object, but as in the previous task, a similar analysis of the results can be observed. Fig. 4 and Fig. 5 illustrate the average cumulative reward and the number of explored states obtained during 49 iterations. Also in this case, vanilla-UCT has comparable reward values, but the number of explored states is still significantly higher than each configuration of HI-VAL. The action hierarchy of HI-VAL, in fact, improves the overall performance showing the best results with a 3 layered structure. Particularly in these complex scenarios -where ordering constraints exist and the task can be decomposed -HI-VAL performs better and confirms that a multi-layered representation of action semantics improves the exploration of the search space.

Conclusion
In this paper we introduced HI-VAL, an iterative learning algorithm of hierarchical value functions for policy generation. We discussed its key features and described how it improves search space exploration in order to generate efficient policies. The results of our experimental evaluation show the efficacy of HI-VAL in enabling the agent to learn a good policy by evaluating a significant lower number of states. In fact, HI-VAL can be used to solve different tasks in multiple domains. Finally, we are investigating different directions to further improve HI-VAL, such as a proper formulation for continuous problems. Moreover, we want to explore the possibility of transferring hierarchical value functions among learning agents in order to take advantage of abstract actions and their semantics in different tasks.