Autonomous Ball-Collecting Robot using Reinforcement Learning
This is a project where we used reinforcement learning to train the robot to collect balls in a simple environment. The robot is a 4-Mecanum wheeled robot equipped with a camera and a 2D LiDAR sensor. The algorithms were implemented on Intel NUC and Nvidia Jetson TX2 board.
This is my attempt at re-writing a write-up because a lot of recruiters ask me how the project was done and I sometimes forget (it was done in 2018).
Optimal Policy and Optimal Value Functions
Bellman optimality equation
For optimal value function $v_{*}$:
$$
\begin{aligned}
v_{*}(s) &= \max_{a} E \left[ R_{t+1} + \gamma v_{*}(S_{t+1}) \vert S_{t}=s, A_{t}=a \right] \\
&= \max_{a} \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma v_{*}(s') \right]
\end{aligned}
$$
For optimal action-value function $q_{*}$:
$$
\begin{aligned}
q_{*}(s,a) &= E \left[ R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') \vert S_{t} = s, A_{t} = a \right] \\
&= \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma \max_{a'} q_{*}(s',a') \right]
\end{aligned}
$$
Monte Carlo Methods
Monte Carlo methods only require a sample of states, actions, and rewards from interaction between the agent and the environment. It is model-free; the probability distributions such as state-transition $p(s' \vert s,a)$ need not to be known.
Q-Learning
$$ Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') \right] $$
Initialize Q(s, a)
Start with state "s"
Loop:
Select action "a" with e-greedy
Execute action "a", receive immediate reward "r" and go to state "s'"
Q(s, a) = Q(s, a) + alpha*[r + gamma * max{Q(s', a')} - Q(s, a)]
But what if the number of $(s,a)$ pairs are very big? Then it is not practical to keep the table $Q$ for every pair of $(s,a)$. Instead, we could maintain $Q(s,a)$ as a parameterized function. This is where the neural network comes into play.
Q-Network
Training
We define the loss as:
$$ L(\theta) = \left( \left( r + \gamma \max_{a'} {Q(s', a' | \theta }) \right) - Q(s,a|\theta) \right)^2 $$
Our objective is to find weight $\theta$ of the network to minimize the loss:
$$ \min_{\theta} L(\theta) = \min_{\theta} \left[ \left( r + \gamma \max_{a'} {Q(s', a' | \theta }) - Q(s,a|\theta) \right)^2 \right] $$
where $r$ is the immediate reward that we observe and the term $r + \gamma \max_{a'} {Q(s', a' | \theta })$ is the approximated target.
Convergence
However, using a neural network to represent the action-value function tends to be unstable due to:
- Correlations between samples
- Non-stationary targets
How does DeepMind solve this issue?
- Experience replay
- Separated networks
- Go deeper (added by myself)
Experience Replay
Separated Networks
$$ L(\theta) = \left( \left( r + \gamma \max_{a'} {Q(s', a' | \theta^{-} }) \right) - Q(s,a|\theta) \right)^2 $$
Gallery
References
- Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).