Q Learning — From the basics

Robert MacWha
Nerd For Tech
Published in
9 min readJul 13, 2021

--

Reinforcement learning (RL) is a subset of AI research that deals with training agents to maximize a reward function. RL systems are generally used in tasks where it is difficult to judge single actions but easy to score an overall performance. These systems have two core components: agents and environments.

Agents

An agent is a function that takes in a given state and returns an action. These systems can be built on anything — neural networks and lookup tables being two common examples. Neural networks are often used in applications with complex environments whereas modified lookup tables (Q-Tables) are used in simpler environments.

Environments

The environment is the hard-coded system that the agent learns to navigate. For now, we can treat the environment as a black box that takes in an action and returns a reward and the next state.

Agents act in an environment which then returns a state and a reward

As an example, imagine using a game of pong as an environment. The agent takes an action, say moving the paddle down, which is processed by the environment. The environment then returns the next state, say the position of both paddles and the ball, plus a reward that measures how well the agent is playing.

Building a Reinforcement Learning model

All code in this article will be written in python. OpenAI’s Gym library for the environments and can be installed through pip or anaconda. An alternative option is to use Google Colab — however, you won’t be able to watch the agent operate in the environment.

Creating a Random Hill Climber

Let’s start by building the simplest kind of agent — a random walker. A random walker is an agent that takes random actions regardless of their state. This means that they are extremely simplistic but they also will never improve, making them ineffective RL models.

To create this random walker we need to do two things: build an environment and build an agent. All the environments in this article will be constructed using the Gym library. After downloading the library gym. create() can be used to initialize the environment.

import gymenv = gym.create('MountainCar-v0') # create the hill climbing environment
env.reset() # remember to reset the environment before taking any actions

Creating the agent is even easier — since it’s just a random walker we can use gym’s env.action_space.sample() function to generate a valid random action. Each action is an integer number where 0 means to move to the left, 2 means to move to the right, and 1 means to do nothing.

action = env.action_space.sample()
env.step(action)

Now we can put the agent’s action code inside of a for loop, so it takes more than one action, and there you go: a random walker in a newly created environment.

import gymenv = gym.make('MountainCar-v0')
env.reset()
for i in range(1000):
env.render()

action = env.action_space.sample()
env.step(action)
env.close()
Sample random walker — script available here

Building Intelegent Models

Having a random walker is fine if you just want to see an agent interacting with an environment, but something more complex is needed if we want the agent to learn. The easiest way to do this is by making a Q-Table.

A lookup table for the game of pong

A Q-Table can be thought of as a lookup table. In one column you have every states that the agent might encounter. In the other column, you have the action that should be taken in said state.

To create this Q-Table we need to know two things. How many actions can the agent take and how many states can it be in?

Getting the number of actions is relatively easy — it can just be read from the env.action_space.n variable. The state-space can also easily be found by taking the length of a random observation.

ACTION_SPACE = env.action_space.n # 3OBSERVATION_SPACE = len(env.observation_space.sample()) # continuous(2)

Creating Q-Tables

Once the action and state spaces are determined we can start to build the Q-Table. The first thing to do is to take the continuous observation space returned by the environment and chop it up into a set of discrete cells. This is done because within the continuous observation space there is an infinite number of discrete states.

These two states are technically different but they are so similar that it doesn’t make sense to treat them as such.

Why do we care if there are an infinite number of discrete states? Without doing that we’d never be able to build the Q-Table. Imagine writing an instruction book where you need a separate page for every possible configuration of date, time, weather, and the user’s middle name. It couldn’t be done. So instead we’ll combine similar states into discrete blocks.

So, let’s create the Q-Table. For this environment, it will be of shape [20, 20, 3] — the first two dimensions corresponding to 20 discrete cells for each of the parameters in the observation space and the final dimension corresponding to the three actions.

Q_INCREMENTS = 20 # the number of discrete cells
DISCRETE_OS_SIZE = [Q_INCREMENTS] * OBSERVATION_SPACE # an array of shape [20, 20]
q_table = np.random.uniform(
low=-1, # low value is minimum reward
high=0, # high value is maximum reward
size=(DISCRETE_OS_SIZE + [ACTION_SPACE])) # an array of shape [20, 20, 3]

Now that we have the Q-Table we also need to build a function that can convert the continuous state into a discrete cell so we can index the Q-Table.

def continuous_Observation_To_Index(env, observation, increments):

# get the range of the observation_space so it can be normalized
obs_min = env.observation_space.low
obs_max = env.observation_space.high
# normalize the observation so it goes from 0-1
obs = (obs - obs_min) / (obs_max - obs_min)
# convert the normalized observation into an integer index
indice = tuple(np.round(obs * increments).astype(int))
return indice

We can now modify the random walker to take actions based on a pre-generated Q-Table with these two code blocks. In any given state the walker will take the action with the highest corresponding value in the Q-Table. This is done because later down the line when we start training the agent, the highest value will correspond with the action that generates the highest future reward.

# store the initial state of the environment
done = False
observation = env.reset()
while not done:
env.render()
# get the action coresponding to the current observation
indice = obs_To_Index(env, observation, Q_INCREMENTS)
action = q_table[indice].argmax()
# take the action
new_observation, reward, done, info = env.step(action)
observation = new_observation
env.close()

Training

Now that we’ve got a Q-Table everything should just work, right? Not quite. We have the instruction booklet but currently all of the pages are just filled with gibberish. The next step is to update the values of the Q-Table to better reflect what actions the agent should take in a given state.

To update the Q-Table we can use a Bellman equation, a function that predicts the future reward from the current state and reward. This equation works by slowly updating the value in the Q-Table to match the predicted future reward, a numeric representation of how ‘good’ a state is.

How do we find this predicted future reward? Well, that’s the value that we’re storing in the Q-Table. Ergo, to calculate the new value for the Q-Table, we can use this equation:

This equation can then be translated into code and implemented in our algorithm so the agent can start to learn. First of all the learning rate (lr) and discount (Y) hyperparameters need to be defined.

LEARNING_RATE = 0.1 # lr - how quickly values in the q table change
DISCOUNT = 0.95 # Y - how much the agent cares about future rewards

Then we can specify the number of epochs to train the model for. Think of an epoch as a single training cycle. The agent won’t magically learn everything in a single lesson so we need to reinforce the lessons by training it many times.

EPOCHS        = 5000

Finally, we can convert the above bellman equation into python code so it can be integrated into our script.

# calculate the predicted future reward
new_indice = obs_To_Index(env, new_observation, Q_INCREMENTS)
future_reward = reward + DISCOUNT * q_table[new_indice].max()
# update the value in the q table
current_q = q_table[indice + (action,)]
new_q = (1 - LEARNING_RATE) *
current_q + LEARNING_RATE * future_reward

Once that’s complete, we’ll modify the training section of our previous script to integrate the training code.

for e in range(EPOCHS):    # store the initial state of the environment
done = False
observation = env.reset()
while not done:
env.render()
# get the action coresponding to the current observation
indice = obs_To_Index(env, observation, Q_INCREMENTS)
action = q_table[indice].argmax()
# take the action
new_observation, reward, done, info = env.step(action)
# train the Q-Table # calculate the predicted future reward
new_indice = obs_To_Index(env, new_observation, Q_INCREMENTS)
future_reward = reward + DISCOUNT * q_table[new_indice].max()
# update the value in the q table
current_q = q_table[indice + (action,)]
new_q = (1 - LEARNING_RATE) *
current_q + LEARNING_RATE * future_reward
# update the observation
observation = new_observation
env.close()

Now that you’ve finished writing the script, give it a run and wait for a few hundred epochs while the agent trains. This might take a while depending on your hardware, but once it’s done you’ll be left with one very capable car.

Sample fully trained Q-Table — script available here

Epsilon

There’s one last tweak that should be made to the agent so it can learn to navigate more complex environments. This agent can currently navigate simple environments, however, once it’s found a solution it will stop improving even if the solution is sub-optimal. This is because improving its performance would necessitate performing actions that don’t lead to the highest predicted reward, something that the agent can’t currently do.

This won’t be an issue in simpler environments, but in more complex simulations where there are different methods of getting rewards it can be detrimental to an agent’s success.

So, how do we stop the agent from taking the same actions over and over again? Why, by forcefully making it take random actions some percentage of the time. This is what epsilon does, it makes the agent explore the environment and search for better solutions.

Epsilon’s implementation is quite simple. All you’ll need are two new hyperparameters — Epsilon and Epsilon decay.

EPSILON = 0.5          # how often the agent takes random actions
EPSILON_DECAY = 0.9998 # rate at which epsilon gets reduced

Epsilon controls the rate at which the agent takes random actions. At the start of the training process this number will be quite high, but as the agent improves we’ll reduce it. This reduction process is handled by the epsilon decay variable.

Implementing this system necessitates changing the code where an action is selected. Rather than always taking the action specified by the Q-Table, the model will now sometimes take a random exploratory action.

# select the action to take
if random.uniform(0, 1) < EPSILON:
action = env.action_space.sample() # random action (exploration)
else:
action = q_table[indice].argmax() # action from the q table

Lastly, we need to reduce epsilon at the end of every epoch.

EPSILON *= EPSILON_DECAY

That’s it! For simple environments like the agent might take slightly longer to train with the epsilon implemented, but this is a fair price to pay for it generating an optimal solution.

Sample fully trained Q-Table with epsilon— script available here

Thanks for reading my article! Feel free to check out my portfolio, message me on LinkedIn if you have anything to say, or follow me on Medium to get notified when I post another article!

--

--

Robert MacWha
Nerd For Tech

ML Developer, Robotics enthusiast, and activator at TKS.