Autonomous Ball-Collecting Robot using Reinforcement Learning

This is a project where we used reinforcement learning to train the robot to collect balls in a simple environment. The robot is a 4-Mecanum wheeled robot equipped with a camera and a 2D LiDAR sensor. The algorithms were implemented on Intel NUC and Nvidia Jetson TX2 board.

This is my attempt at re-writing a write-up because a lot of recruiters ask me how the project was done and I sometimes forget (it was done in 2018).


Optimal Policy and Optimal Value Functions

Bellman optimality equation

For optimal value function $v_{*}$:

$$ \begin{aligned} v_{*}(s) &= \max_{a} E \left[ R_{t+1} + \gamma v_{*}(S_{t+1}) \vert S_{t}=s, A_{t}=a \right] \\
&= \max_{a} \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma v_{*}(s') \right] \end{aligned} $$

For optimal action-value function $q_{*}$:

$$ \begin{aligned} q_{*}(s,a) &= E \left[ R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') \vert S_{t} = s, A_{t} = a \right] \\
&= \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma \max_{a'} q_{*}(s',a') \right] \end{aligned} $$

Monte Carlo Methods

Monte Carlo methods only require a sample of states, actions, and rewards from interaction between the agent and the environment. It is model-free; the probability distributions such as state-transition $p(s' \vert s,a)$ need not to be known.

Q-Learning

$$ Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') \right] $$

Initialize Q(s, a)
Start with state "s"
Loop:
    Select action "a" with e-greedy
    Execute action "a", receive immediate reward "r" and go to state "s'"
    Q(s, a) = Q(s, a) + alpha*[r + gamma * max{Q(s', a')} - Q(s, a)]

But what if the number of $(s,a)$ pairs are very big? Then it is not practical to keep the table $Q$ for every pair of $(s,a)$. Instead, we could maintain $Q(s,a)$ as a parameterized function. This is where the neural network comes into play.

Q-Network

Training

We define the loss as:

$$ L(\theta) = \left( \left( r + \gamma \max_{a'} {Q(s', a' | \theta }) \right) - Q(s,a|\theta) \right)^2 $$

Our objective is to find weight $\theta$ of the network to minimize the loss:

$$ \min_{\theta} L(\theta) = \min_{\theta} \left[ \left( r + \gamma \max_{a'} {Q(s', a' | \theta }) - Q(s,a|\theta) \right)^2 \right] $$

where $r$ is the immediate reward that we observe and the term $r + \gamma \max_{a'} {Q(s', a' | \theta })$ is the approximated target.

Convergence

However, using a neural network to represent the action-value function tends to be unstable due to:

  • Correlations between samples
  • Non-stationary targets

How does DeepMind solve this issue?

  • Experience replay
  • Separated networks
  • Go deeper (added by myself)

Experience Replay

Separated Networks

$$ L(\theta) = \left( \left( r + \gamma \max_{a'} {Q(s', a' | \theta^{-} }) \right) - Q(s,a|\theta) \right)^2 $$


References

Related