RL_Paper_02:DQN

Human-level control through deep reinforcement learning

Paper Source
Journal: Nature
Year: 2015
Institute: DeepMind
Author: Volodymyr Mnih1*, Koray Kavukcuoglu1*, David Silver1*
#Deep Reinforcement Learning(DRL)

Abstract

To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.Remarkably, humans andother animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms.

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as $Q$) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution, and the correlations between the action-values ($Q$) and the target values $r + \lambda max_{a’}Q(s’, a’)$.
Deep Q network use two methods to reduce this correlation: experience replay and target network.

1.INTRODUCTION
They parameterize an approximate value function $Q(s, a; \theta_i)$ using the deep convolutional neural network.
Experience replay
Store the agent’s experiences $e_t = (s_t, a_t, r_{t+1}, s_{t+1})$ in a buffer $D_t = \{e_1, \dots, e_t\}$. During learning, we apply Q-learning updates, on samples (or minibatches) of experience $(s,a,r,s’) \sim U(D)$,
Loss Function:

$L_i(\theta_i) = E_{(s, a, r, s') \sim U(D)}[(r + \lambda max_{a'}Q(s', a';\theta_i^{-}) - Q(s_i, a_i;\theta_i))]$

Target network
To simulate the supervise learning which needs a labeled target.
The target network parameters $\theta_i^{-}$ are only updated with the Q-network parameters $\theta_i$ every C steps and are held fixed between individual updates.
2.METHODS
Preprocessing
Rescaling m rescent frames. (RGB->GrayScale; 210X160->84X84)
Model architecture
Input Size: 84x84x4
Inpue Layer: Conv layer: 32 filters, size 8x8, stride 4, relu
First Hidden Layer: Conv layer: 64 filters, size 4x4, stride 2, relu
Second Hidden Layer: Conv layer: 64 filters, size 3x3, stride 1, relu
Final Hidden Layer: Fully-connected and consists of 512 rectifier units.
Output Layer: Fully-connected linear layer with a single output for each valid action(Output size varied between 4 and 18).