From the course: Reinforcement Learning Foundations

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

SARSAMAX (Q-learning)

SARSAMAX (Q-learning) - Python Tutorial

From the course: Reinforcement Learning Foundations

Start my 1-month free trial

SARSAMAX (Q-learning)

- [Instructor] SARSAMAX is another form of temporal difference method, also popularly known as Q-Learning. It is just another slight change in the Bellman Equation or how the Q table is updated. Quick recap, for SARSA, we use the same policy to pick a state, select an action for the next state get the reward of selecting that action, landing in the next state and then choosing an action. After this cycle, it then updates the action value of the four state thereby updating the policy. This update cycle is different for SARSAMAX. The similarity between SARSA and SARSAMAX is that the same policy is only used to the point where the agent selects the second state. After that point, the policy in SARSAMAX is then updated by updating the action value of the first state before choosing the next action. This next action is selected using the Greedy Policy as opposed to the Epsilon-Greedy Policy used in Monte Carlo methods. Remember that greedy policies are policies that select actions that…

Contents