Reinforcement Learning - A Gentle Introduction

Reinforcement Learning - A Gentle Introduction
Photo by DeepMind on Unsplash

Reinforcement learning is what I thought machine learning would be when I first heard about it.

I remember reading Kasparov's book about his chess matches against Deep Blue, imagining all the clever algorithms they must have employed to create it. When I chose machine learning as a career many years later, I was surprised that the algorithms bore little resemblance what I'd imagined. The algorithms predicted or classified – they didn't play.

But then I found reinforcement learning.

Reinforcement learning is a field where computer programs learn to play games. I've been studying it for some time now and wanted to share what I've learned.

Firstly, a warning. Conceptually, reinforcement learning is much tougher to understand and code than regular machine learning. If you're starting out, you shouldn't start here. However, I'll leave the descriptions reasonably high-level because I aim to explain qualitatively how models work rather than give a detailed account of the implementation.

Some definitions

Like all fields of study, reinforcement learning has jargon that acts as a barrier to entry. I will smash through the jargon as quickly as possible before getting onto the exciting parts. Here we go:

  1. In reinforcement learning, there's an environment.

You can think of the environment as the totality of a game. It could be a chess game, the stock market, machines in a warehouse, or a video game. Anything with a stable set of rules.

2. There is an agent in the environment.

The agent is an entity that can act. If it's chess, the agent would be the player. If it's the stock market, it's the person buying or selling the shares.

3. The agent can act.

To act (or take an action) is a technical term. The rules of the game limit the actions. For example, if the game is chess, an action would be moving a piece. The set of possible actions depends on the state of the board and the game's rules.

4. The environment is always in a state.

The state of the environment is like the pieces on a chess board. Note that the set of possible states an environment can be in may be essentially infinite.

5. When an agent takes action, that action might change the state of the game.

If the game is chess, and the agent moves a piece, the game is now in a new state.

6. When an agent takes action, it receives a reward.

This is where things start to get interesting. A reward is not necessarily something that comes out of the game itself, but might be something we impose on it. For example, if the game is chess, we might be rewarded for taking an opponent's piece. That's exactly how I remember learning the game. A pawn is worth 1 pt, a bishop 3 pts, a rook 5 pts and so on. It allows me to evaluate the quality of a move quickly. If I exchange my knight for your rook, that's usually wrong. But this idea of pieces being worth 'points' is entirely a fabrication. The winner of a chess game is the one who checkmates the opponent's king, not the one who scores the most 'points'. But tallying points can help me learn early and mid-game by giving me feedback on how I'm doing. That way, I don't need to wait until the end of the game to figure out whether a move I did at the start was good. It's the same for a computer - quick feedback is preferred.

7. This type of problem is called a control problem.

In a nutshell, reinforcement learning aims to have agents learn the best action to take. That is, it takes the state as input and returns the best action.

8. Agents in reinforcement learning must exhibit the Markov decision property.

A game that exhibits a Markov decision property is one where knowledge of the present state is enough or the agent to learn which actions to take. Games where agents must keep track of past states are not suitable for the reinforcement learning techniques presented here.

One point to note: some environments that do not naturally exhibit Markov decision properties can be made to do so by stuffing more information into what we call a 'state'. For example, if the environment we want to learn in is the videogame Pong, then we might want a single state to include several frames of the game, so the agent has a chance to understand which way the ball is moving. If we only gave it a single frame, it would have no idea whether the ball was moving towards or away from it!

9. In reinforcement learning, agents have a policy.

A policy determines the action an agent takes when presented with a state. In the game pontoon, then a simple policy could be 'twist whenever I have a score less than 15'.

A policy might be a fixed rule (like in the pontoon example), or it might be a complex decision that can't easily be expressed in words.

10. Under a given policy, we can define the value of a state as the weighted sum of all the rewards.

Imagine the environment is a computer game where the agent has to navigate out of a maze. A normal method of rewarding moves could be:

  • -1 reward for each move that isn't exiting the maze
  • +10  for exiting the maze

With this set up, the agent is rewarded for finding the exit to the maze. But he is also punished (with a negative reward) for taking extra moves. With this set of rewards, the agent will hopefully learn to find the exit to the maze efficiently.

When we say 'weighted sum', we multiply future rewards by a number (generally between 0 and 1) that lowers the score for moves that make far into the future.

Why do we do that?

Intuitively, we could say that a move that happens far in the future may have had little to do with the state of the environment is in at the moment, and we don't want to give rewards for unconnected events.

11. The value of an action is simply the expected value of the state of the, given than I've just taken the action.

Hopefully this is self-explanatory. I evaluate the value of an action by saying it's equal to the expected reward from being in the game-state that follows from taking that action.

Q-learning

What have we learnt so far?

We have an agent in an environment that can take actions. An agent's action, which is determined by it's policy will change the state of the environment, and it will receive a reward. The value of a state is defined as the discounted sum of the rewards.

Not too bad.

Now, here's the clever bit – Q-Learning.

Imagine that your agent has a set of beliefs about the value of taking an action while in a particular state. Q-learning is a method for your agent to update his beliefs, given some actual feedback from the world. Here's updating rule:

Q(state,action) -> Q(state,action) + learning_rate*[reward + discount_factor*max(Q(next_state,action) - Q(state, action)]

If I drop the inputs to the functions to make it easier to read, we get:

Q -> Q + α[R + γ*max(Qnextstate) - Q]

In words, we update our belief about the value of taking an action, while in a particular state. In bullets (just to hammer home the message).

  1. We take the action.
  2. We observe the reward.
  3. We use our current belief of the value of states to evaluate the value of the next state, times a discount factor.
  4. We take away from that our original current belief.
  5. We multiply that by a learning_rate α.
  6. We add that to our current belief.

If our original belief of the value of a state was accurate, then R + γ*max(Qnextstate) = Q. That is, we wouldn't perform any updates.

Just to give you some context, it took me about three days to get my head around this formula, so there's no need to rush (though, I accept, you might be quicker than I am). The really clever bit is that I'm using my current belief in the update formula. This is called bootstrapping (as in, pulling myself up by my bootstraps). It means I don't need to work out the actual value of a move in a given state. I simply observe that a given action had a bigger positive or negative reward than I was expecting, and update my beliefs from there. Eventually, your belief will converge on the correct belief.

If you think about yourself making a move in chess, this approach is entirely sensible. You have a belief about the quality of a move. Maybe your policy is 'always play the move with the highest expected value'. So you play the move. But the opponent takes your queen! Not what you expected. So, you make a note to yourself: 'From now on, check whether he can take my queen.' From now on, you evaluate the value of a move through this new lens.

A note about policy: exploration vs exploitation

In the previous equation, we left out any mention of policy. That is, I didn't say how the agent decided to take the action. Just that, however it decided to take the action, it had a belief about the action's value before it took the action, and then it updated its belief by observing what actually happened.

There are lots of ways of deciding on a policy. We already saw one earlier: take the action you believe has the highest payoff.

Wouldn't you always just take this policy? The answer is: it depends on what you're trying to achieve.

You might be trying to actually win the game. In which case, you want your agent to take the best move it can. However, you might want the agent to explore lots of different states, in which case, you might have it take random actions.

One possibility is to start the agent off acting randomly, and then, later, have it take the moves with the highest value. This will allow it to see what rewards it gets in a whole range of situations (exploration) before settling on the best moves (exploitation).

This is one way in which reinforcement learning differs from other types of machine learning – the agent interacting with the game creates the dataset, which it then uses to update its beliefs. Sometimes, you want your agent to play randomly, so it can see different game states. It doesn't matter if it loses while training.

Conclusion

This is nearly 2,000 words, and I've barely scratched the surface of reinforcement learning That said, if you understand the algorithm described here, much of the literature will make sense to you.

If you want to learn more about the topic, and look at some actual code, I strongly recommend Manning's book. In fact, I've yet to come across a bad book in their library – they're both cheap an of the highest quality.