RL_Paper_01

Aerobatics Control of Flying Creatures via Self-Regulated Learning

Paper Source
Journal: ACM Transactions on Graphics
Year: 2018
Institute: Seoul National University, South Korea
Author: JUNGDAM WON, JUNGNAM PARK, JEHEE LEE*
#Physics-Based Controller #Deep Reinforcement Learning(DRL) #Self-Regulated Learning

Abstract

Self-Regulated Learning (SRL), which is combined with DRL to address the aerobatics control problem. The key idea of SRL is to allow the agent to take control over its own learning using an additional self-regulation policy. The policy allows the agent to regulate its goals according to the capability of the current control policy. The control and self-regulation policies are learned jointly along the progress of learning. Self-regulated learning can be viewed as building its own curriculum and seeking compromise on the goals.

自调节学习(SRL)，与DRL相结合，解决特技飞行控制问题。SRL的关键思想是允许代理(agent)使用额外的自我调节政策来控制自己的学习。该政策允许代理根据当前控制政策的能力来调节其目标。控制和自我调节政策是沿着学习的过程共同学习的。自我调节学习可以被视为建立自己的课程并寻求对目标的妥协。

1.INTRODUCTION
Defining a reward for taking an action is the primary means by which the user can influence the control policy. The reward is a succinct description of the task. The choice of the reward also affects the performance of the controller and the progress of its learning.

We consider a class of problems in which the main goal can be achieved by generating a sequence of subgoals and addressing each individual subgoal sequentially.

定义采取行动的奖励是用户可以影响控制策略的主要手段。奖励是对任务的简洁描述。奖励的选择也会影响控制器的性能和学习的进度。
我们考虑这一类特定问题，其主要目标可以通过生成一系列子目标并按顺序处理每个单独的子目标来实现。

2.RELATED WORK
Optimality principles played an important role of popularizing optimal control theory and nonlinear/non-convex optimization methods in character animation. We can classify control policies (a.k.a., controllers) into immediate control and predictive control depending on how far they look ahead into their future.

Optimality principles: The principle of optimality is the basic principle of dynamic programming, which was developed by Richard Bellman: that an optimal path has the property that whatever the initial conditions and control variables (choices) over some initial period, the control (or decision variables) chosen over the remaining period must be optimal for the remaining problem, with the state resulting from the early decisions taken to be the initial condition.

Immediate control policy: A direct mapping from states to actions.
Model predictive control: The key concept is to predict the future evolution of the dynamic system for short time horizon and optimize its control signals. Model predictive control repeats this prediction and optimization step while receding the time horizon.

Recently, Deep Reinforcement Learning (DRL) has shown its potential in simulation and control of virtual characters. DRL for continuous control (especially actor-critic framework[survey]) has advantages of both immediate control and predictive control.

SRL assumes that the reward function provided initially is not ideal for achieving the main goal and thus allows the agent to transform the reward function adaptively. Underlying motivation of SRL is closely related to automatic curriculum generation for RL agents.

最优性原理在推广最优控制理论和角色动画中的非线性/非凸优化方法起到了重要作用。我们可以将控制策略（也就是控制器）分类，根据它能够对未来展望多远分为立即控制和预测控制。　

最优性原理：最优性原理是动态规划的基本原理，由R.Bellman等人通过研究一类多阶段决策问题提出，“作为整个过程的最优策略，无论过去的状态和决策如何，对前面的决策形成的状态而言，余下的诸决策必构成最优策略。”　简言之，一个最优策略的子策略总是最优的。　

立即控制策略：从状态到动作的直接映射。
模型预测控制：关键概念是预测短期内动态系统的未来发展，优化其控制信号。模型预测控制在时间范围后退的同时重复这一预测和优化步骤。
近年来，深度强化学习（DRL）在虚拟人物的仿真和控制方面显示出其潜力。连续控制的DRL（尤其是演员－批评家框架）具有立即控制和预测控制的优点。
SRL假设最初提供的奖励函数不适合实现主要目标，因此允许代理自适应地转换奖励函数。SRL的潜在动机与RL代理的课程自动生成密切相关。

3.ENVIRONMENT
There refers to some basic specific knowledge about physics and physics control. So just understand the concepts of these definations and symbols to the degree of understanding the following algorithms without any troubles.

Trajectory $C(\sigma) = (R(\sigma), p(\sigma), h(\sigma))$

Progress Parameter $\sigma \in [0, 1]$
Orientation $R(\sigma) \in SO(3)$ What SO(3) is?
Position $p(\sigma) \in R^3$
Clearance Threshold $h(\sigma)$

Distance $d(R, p, \sigma) = ||\log(R^{-1}R(\sigma))||^2_F + w_p||p − p(\sigma)||^2$
if $d(R, p, \sigma^*) < h(\sigma^*)$, then $C(\sigma^*)$ is cleared.
Unit tangent vector $t(\sigma) = \frac{\dot{p(\sigma)}}{||\dot{p(\sigma)}||}$
Up-Vector $u = [0, 1, 0]$
Initial Orientation $R(0) = [r^T_x, r^T_y, r^T_z] \in SO(3)$ is defined by an orthogonal frame such that

$r_z = t(0)$,
$r_x = \frac{u×r_z}{||u×r_z||}$
$r_y = r_z × r_x$.

Trajectory Rotation $R(\sigma) = R(\sigma − \epsilon)U(t(\sigma − \epsilon), t(\sigma))$
Minimal Rotation between two vectors a and b
$U(a, b) = I + [a × b]_× + [a × b]^2_× \frac{1 − a·b}{(a × b)^T(a × b)}$
$[v]_×$ is the skew-symmetric cross-product matrix of $v$.
Relaxed Clearance Threshold $h(\sigma) = \bar{h}(1 + w_h ||\ddot{p}(\sigma)||)$
Reward defination
$R(s, a, s′)=\left\{ \begin{array}{ll} \sigma^*(2 − \frac{d(R,p,\sigma^*)}{d_{max}}), if d(R,p,\sigma^*) < h(\sigma^*) \\ 0, otherwise \end{array} \right.$

4.ALGORITHM
DRL Algorithm
That algorithm is similar with actor-critic framework, it also has two deep neural networks, one called $Q$, represents state-action value function, state-action pair $(s, a)$ as input and returns the expectation on cumulative rewards. Another called $\pi$The deterministic policy $\pi$ takes state $s$ as input and generates action $a$.
It consists of two parts:

Generating a seuqence of $e_i = (s_{i−1}, a_i, r_i, s_i)$ tuples to simulate the given trajectory.
Using a single sample of the former sequence to update the state-action value function and policy function repetedly.

Self-Regulated DRL
Intuitively speaking, the agent senses its body state $s_d$ and a part of the trajectory $(\sigma, s_s)$, and decides how to act and how to regulate the current subgoal $C(\sigma)$ simultaneously while avoiding excessive deviation from the input trajectory.

Experts in related field
Jason Peng
Libin Liu