Reinforcement Learning Summary

Value Function Approximation

Value Function Approximation
Intro
Supervised Learning Formulation
Estimating Target Value
TD(λ)
LSPI

Intro

Tabular methods use discrete space
This might be infeasible for big environments
It is basically ML

Supervised Learning Formulation

To use RL with this we need to find a parameter vector that represents the data. This can be achieved with supervised learning. We need to minimize the error of the parametrized value function V(s;θ):

$L(\theta) = E[(V^\pi(s) - V(s,\theta))^2]$

This can be done analytically with the partial derivative gradient:

Estimating Target Value

The target value might not be the true value of but a noisy version of it.

TD(λ)

TD lambda gradient-descent

Here V is implicitly a function of θ

LSPI

Least-Squares Policy Iteration
source: https://www.cs.duke.edu/research/AI/LSPI/jmlr03.pdf
RL for control
model-free
off-policy
uses LSTD (Least-Squares Temporal-Difference)
separation of sample collection and solution

function LSPI-TD(D, π0)
  π' <- π0
  repeat
    π <- π'
    Q <- LSTDQ(π, D)
    for add s in S do  # simple greedy policy based on Q
       π'(s) <- argmax Q(s, a)
  until π ~ π'
  return π