Autonomous Ball-Collecting Robot using Reinforcement Learning

Aug 1, 2018

Contents

This is a project where we used reinforcement learning to train the robot to collect balls in a simple environment. The robot is a 4-Mecanum wheeled robot equipped with a camera and a 2D LiDAR sensor. The algorithms were implemented on Intel NUC and Nvidia Jetson TX2 board.

This is my attempt at re-writing a write-up because a lot of recruiters ask me how the project was done and I sometimes forget (it was done in 2018).

Optimal Policy and Optimal Value Functions

Bellman optimality equation

For optimal value function $v_{*}$:

$$ \begin{aligned} v_{*}(s) &= \max_{a} E \left[ R_{t+1} + \gamma v_{*}(S_{t+1}) \vert S_{t}=s, A_{t}=a \right] \\
&= \max_{a} \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma v_{*}(s') \right] \end{aligned} $$

For optimal action-value function $q_{*}$:

$$ \begin{aligned} q_{*}(s,a) &= E \left[ R_{t+1} + \gamma \max_{a'} q_{*}(S_{t+1}, a') \vert S_{t} = s, A_{t} = a \right] \\
&= \sum_{s',r} p(s',r \vert s,a) \left[ r + \gamma \max_{a'} q_{*}(s',a') \right] \end{aligned} $$

Monte Carlo Methods

Monte Carlo methods only require a sample of states, actions, and rewards from interaction between the agent and the environment. It is model-free; the probability distributions such as state-transition $p(s' \vert s,a)$ need not to be known.

Q-Learning

$$ Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') \right] $$

Initialize Q(s, a)
Start with state "s"
Loop:
    Select action "a" with e-greedy
    Execute action "a", receive immediate reward "r" and go to state "s'"
    Q(s, a) = Q(s, a) + alpha*[r + gamma * max{Q(s', a')} - Q(s, a)]

But what if the number of $(s,a)$ pairs are very big? Then it is not practical to keep the table $Q$ for every pair of $(s,a)$. Instead, we could maintain $Q(s,a)$ as a parameterized function. This is where the neural network comes into play.