Reinforcement Learning with OpenAI Gym

Python project, TensorFlow.

In previous articles, we have explored different areas of computer vision using neural networks. Here, we will discuss a new area called Reinforcement Learning that uses neural networks to create agents capable of performing a task. This article describes several Reinforcement Learning techniques and shows how to build an agent capable of training in different OpenAI Gym environments.

GitHub link: https://github.com/Apiquet/Reinforcement_learning

Table of contents

  1. Overview
  2. Q-Learning in Toy Text environments
    1. Environment explanations
    2. Agent evolution
  3. Reinforce in Box2D environments
    1. Environment description
    2. Training loop
    3. Agent implementation
      • Observe and decide methods
      • Discount reward
      • Model and train method
    4. Results
  4. Reinforce in Classic Control environments
  5. Conclusion

1) Overview

In this article, we will explore a new domain of application for Deep Learning. In this field, we want to teach agents to behave in a specific way in an environment. For instance, we can teach a lunar lander to land on a specific position:

Source: gym.openai

We can also teach an humanoid to walk as fast as possible in a 3D environment:

Source: gym.openai

There is no limit for this domain as many applications involve agents: autonomous agents (cars, drones, robots in industries, etc.), games (to build opponents for human players), bot (trading), etc. The key here is for an agent to learn from its experiences by doing repeated episodes on the environment until its behavior matches that expected.

Even if each environment and agent are unique, we can build a generic way to encode them to make agents learn. At a specific time t, the agents and its environment are in a specific state and the agent can do specific actions. As we want to focus on agents’ implementation, we will use environments from gym.openai. This library allows to use many kind of environments with built-in methods to get the environment’s state and to take actions. We will develop how to use it in the next sections.

To understand the concept of state and action, we will use a simple environment: FrozenLake-8×8:

In this environment called Frozen Lake, we have:

  • S: starting point
  • F: frozen surface where the agent can walk
  • H: hole, the agent fall into water
  • G: goal to reach
  • Highlight: current agent position

The agent can move Up, Down, Left and Right so 4 possible actions. As the environment is static (the locations of the holes does not change), the state is the agent position so 8×8=64 possible states. In function of the result of a particular action on the environment, the agent got reward or penalties to inform how well or how bad its move was.

Even if the environment is more complex, the final encoding will be states of N dimensions and a particular set of actions.

The next section will explain how to use this encoding to make and agent reach the goal instead of falling into a hole like in the previous animation.

2) Q-Learning in Toy Text environments

Q-Learning technique is a way to figure out the best action to take on each possible state. It convert the states and the actions into a single matrix named q-table SxA with S the number of states and A the number of possible actions. In our Frozen Lake environment, it will be a matrix of shape (64, 4). This table should be updated during training with the given reward or penalties that the agent got from a particular action:

current time: t, action: a, reward: r, state: s (source: Wikipedia)

As a current state was not only caused by the previous action but also by all the previous ones, we need to update the values in function of future expected rewards. This term was added by maxQ which takes the maximum expected reward from the next state.

A last thing to do is to add an Epsilon term that will avoid overfitting. Indeed, with the simple strategy explained above, the agent should take the action that leads to the maximum reward. However, if we do not try other actions randomly, the agent will always choose the same action because it got reward for it. To allow the agent to find the optimal action, we add this Epsilon term to have a probability of taking a random action instead of the one that lead to the better reward (based on the previous experiments). This is called epsilon-greedy action selection, it allows to choose between exploration and exploitation with a probability.

2-1) Environment explanations

Note: all the code is available here.

To build the Frozen Lake environment, we only need to install gym and then run the following code:

.reset() allows to reset the environment and .render() display it. We can see that our agent is on the starting point. Then, we can confirm the number of actions and states we previously calculated:

We also need a way to get the current state and reward after doing an action. The .step(action) method gives us all needed information:

We can see that .step(action) takes as input an action (in this environment, a number from 0 to 3 for [“Left”, “Down”, “Right”, “Up”]) and returns [state, reward, done, info].

  • The state is a number from 0 to 63: the starting point is the state 0, we are currently on the state 1, the F under the starting point is the state 8, etc.
  • The reward is the reward got after a move: in this environment we always get 0 except if we reach the goal we get 1,
  • Done indicates whether the agent fell into a hole or reached the goal (end of the episode in both cases),
  • info can be anything but in this environment, it indicates the probability that the agent failed when it tries to make a move: as the ice is slippery, the agent won’t always move in the intended direction. This probability makes learning more difficult because the agent does not know if it fell into a hole because it slipped or because it made a wrong move.

Finally, we have the reward table that informs, for a specific state and action, what will be the next state and reward. For instance, if we are on the state 1:

For each state, we have 4 different actions. However, we can see that for the action 0 (“go left”), we only have 33% chance to go into the wanted state 0 (the agent moved to the left). We also have 33% to stay on the current position and 33% to go down. That makes the learning much harder.

If we display the reward table for a hole we have:

In a hole, the “done” Boolean is True (means that the episode is finished), the agent cannot go outside the hole anymore (all the actions lead to the same state: 19). In this case, we need to reset the environment and try again. The agent should have learned to not fall into this hole, after multiple tries.

2-2) Agent evolution

As a first try, we can use the in-built method .action_space.sample() to perform random actions and see what happens. A simple loop that takes random actions until we get the “done” Boolean set to True is enough:

env.reset()
imgs = []
done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)
    imgs.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward,
        'done': done
        }
     )

As we can guess, the agent will fall into a hole quickly:

We can now implement the Q-learning technique we saw previously. First, we can implement our Q-table of shape (number of states, number of actions)=(64, 4):

import numpy as np
q_table = np.zeros([env.observation_space.n,
                    env.action_space.n])

To update our loop, we first need to implement which action to take: a random one or the one with maximum expected reward:

while not done:
    # probability of taking a random action
    if random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    # take the action that lead to better reward
    else:
        action = np.argmax(q_table[state])

    state, reward, done, info = env.step(action)

In this environment, the reward is always 0 except when we reach the goal we get 1. I chose to help the agent to discriminate between good and bad moves with new rewards:

# if in a hole
if done and reward != 1:
    reward = -10
# if on a frozen position
elif reward == 0:
    reward = -0.5
# if goal reached
else:
    reward = 20

Finally, we can update the q_table with the new value:

# get current reward for the given action in the current state
current_expected_reward = q_table[state, action]
    
# get max possible calculated reward from next state
next_state_max_reward = np.max(q_table[next_state])

# update q-table with formula
updated_reward = (1 - alpha) * current_expected_reward + alpha * (reward + gamma * next_state_max_reward)
q_table[state, action] = updated_reward

After 200,000 episodes, the agent has a much better behavior:

Even if the agent has only 0.33% probabilities of taking its wanted action (because it can slip on the frozen surface), it reached the goal in only 30 moves.

The code is available here, it can be reused in other Toy Text environments that can be found here.

Q-Learning is a simple algorithm to introduce the field of Reinforcement Learning. The next sections will go deeper with more complex environments and techniques.

3) Reinforce in Box2D environments

The code ca be found on my GitHub.

3-1) Environment description

In this section, we will work in Lunar Lander environment. In this environment, the agent should learn to land on a specific position:

According to the documentation, for each episode, the agent receives the following reward:

  • reward for moving from the top of the screen to landing pad and zero speed is about 100…140 points,
  • if lander moves away from landing pad it loses reward back,
  • episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points,
  • each leg ground contact is +10,
  • firing main engine is -0.3 points each frame,
  • solved is 200 points.

The fuel is infinite and the possible actions are:

  • do nothing,
  • fire left orientation engine,
  • fire main engine,
  • fire right orientation engine.

A state is composed of 8 attributes:

  • x position
  • y position
  • x velocity
  • y velocity
  • angle
  • angular velocity
  • leg 1 ground contact (1, otherwise 0)
  • leg 2 ground contact (1, otherwise 0)

Our agent will take as input a state composed of 8 attributes and will output an action number between 0 and 3.

3-2) Training loop

To train our agent, we need to loop over many episodes. For each episode, the following steps must be done in sequence:

  • reset environment and set total episode rewards to 0
  • do until episode is done
    • ask agent for doing an action regarding the current state
    • do the action
    • get new state, received reward and done boolean
    • add reward to total episode rewards
    • save current state, action made and received reward into agent memory
  • train agent with collected data from each frame of the whole episode

We can also add a step to print the received reward evolution to verify the learning of the agent over the episode. We could also save the frames of some episode (for instance every 100 episodes) to visualize the evolution of the agent’s behavior between frame 0, 100, 200, etc. These two step were implemented in the training loop bellow:

def training(agent, env, number_of_episodes=3000,
             saveEpisodeInterval = 300, printRewardInterval = 20):

    rewards_per_episode = []
    frames_per_episode = []

    for episode_idx in range(number_of_episodes):

        # print previous episode results at wanted interval
        if (episode_idx+1) % printRewardInterval == 0:
            clear_output()
            plt.xlim(0, number_of_episodes)
            plt.plot(rewards_per_episode)
            plt.pause(0.0001)
            print(f"Episode n°{episode_idx}: {sum_rewards} rewards")
    
        # reset episode var
        frames = None
        state = env.reset()
        sum_rewards = 0

        # save frames at wanted interval
        if (episode_idx+1) % saveEpisodeInterval == 0:
            frames = []

        while True:

            # save frames at wanted interval
            if frames is not None:
                frames.append(env.render(mode='rgb_array'))

            # Take action
            action = agent.decide(state)
            next_state, reward, done, info = env.step(action)
            sum_rewards += reward

            # Store information for training
            agent.observe(state, action, reward)

            # Train agent with observed episode
            if done:
                rewards_per_episode.append(sum_rewards)

                # save frames at wanted interval
                if frames is not None:
                    frames_per_episode.append(frames)

                # train agent
                agent.train()
                break

            state = next_state
            
    return frames_per_episode, agent

3-3) Agent implementation

For the agent implementation, we will use a neural network. For those who are not comfortable with dense layers, activation functions, optimizers and backpropagation, all the neural network basics are covered in my series “Neural Network from scratch”.

3-3-a) Observe and decide methods

As mentioned in the previous section, we must be able to do the following two actions with our agent:

  • ask agent for doing an action regarding the current state
  • save current state, action made and received reward into agent memory

The first one will be realized with the decide() method. This method should take as input the current state and should return an action. However, as mentioned in the first section with Frozen Lake environment, we cannot always take the action with the maximum probability because we still want our agent to explore possible better actions that were not found yet. For that, we will choose over the 4 actions with the corresponding probabilities, for instance: if the agent outputs [0.5, 0.1, 0.2, 0.1], we will chose action 1 with 0.5 probability, action 2 with 0.1, etc. This will allow our agent to explore other actions. When the agent will be trained enough, the actions that lead to bad rewards should be close to 0 and the best one close to 1:

def decide(self, state):
    """ 
    Get prediction from model with the observed state
    Choose an action from the output distribution

    Args:
        - (object) environment-specific object representing the observation
        
    Return:
        - (int) action to take
    """
    
    output = self.model.predict(state.reshape(-1, self.n_obs))
    choice = np.random.choice(self.n_act, p=output.reshape(-1))
    self.probability_action.append(output.reshape(-1)[choice])
    return choice

Another method to implement is the observe() one. This method will save the current state, the chosen action and the received rewards. This will allow the agent to know that for a particular state and action, it received a certain reward:

def observe(self, state, action, reward):
    """
    Save state, action and reward to agent memory

    Args:
        - (object) environment-specific object representing the observation
        - (float) action that the agent did
        - (float) reward that the agent got
    """
    self.episode_observations.append(state)
    self.episode_actions.append(action)
    self.episode_rewards.append(reward)

These two method will allow:

  • to save experiences to the agent’s memory
  • to take an action regarding a specific state. A non-zero probability is added to explore a new possible action that currently lead to less rewards based on the previous experiences. Thanks to this probability, the agent will be able to explore all possible actions for any state and this probability will decrease over the training when it will be more confidence about a specific action given a specific state.

3-3-b) Discount reward

As explained in the first section with Frozen Lake environment, our current state/reward was not caused by the immediate action we took previously. Indeed, it was caused by the sequence of all the previous actions. This introduces the need of using all the previous actions taken when evaluating the current reward received by the agent. To do so, we will use the following equation:

With gamma in [0, 1]. For instance, this equation means that for a particular reward n°2, we will use for our agent the dreward:

As gamma is between 0 and 1, the oldest reward will be less important than the recent ones.

def get_discounted_reward(self):
    """ Return discounted episode returns at each step in the episode """
    rewards = np.array(self.episode_rewards)
    discounted_rewards = np.zeros_like(rewards)
    tmp_sum = 0

    for t in range(0, len(rewards)):
        tmp_sum = tmp_sum * self.gamma + rewards[t]
        discounted_rewards[t] = tmp_sum

    return discounted_rewards

3-3-c) Model and train method

As described in the first section, our agent’s model must take as input 8 observations from a state and can do 4 different actions. Its outputs should be 4 numbers which represent the probability for each action to be the one to choose. This probability can be generated with a softmax function (described in the following article):

# build network
self.model = tf.keras.models.Sequential([ 
    # take as input the observations
    tf.keras.layers.Dense(16, input_dim=self.n_obs, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    # output a policy distribution
    tf.keras.layers.Dense(self.n_act, activation='softmax',activity_regularizer=self.entropy)
    ])
opt = tf.keras.optimizers.Adam(lr=self.policy_learning_rate) 
self.model.compile(optimizer=opt, loss='categorical_crossentropy')

Then, we can train our model with the get_discounted_reward() method implemented in the previous section and train_on_batch() built-in method from Keras:

def train(self):
    """ Train model with discounted reward and reset observations lists """
    
    discounted_reward = self.get_discounted_reward()
    actions_one_hot_encoding = np.array(
        tf.keras.utils.to_categorical(self.episode_actions, self.n_act))
    state = np.array(self.episode_observations)

    self.model.train_on_batch(state.reshape(-1, self.n_obs),
                              actions_one_hot_encoding,
                              sample_weight=discounted_reward)
    self.reset_observation_lists()

3-4) Results

The provided training loop plot the received rewards over the episodes. After 3k episodes, we got the following graph:

The agent started around -250 rewards and finished around 100. We can also see that the variance decreases over the episodes, this is due to the exploration factor that also decreases over the episodes (section 3-3-a).

The final agent (available on my Github profile) is able to land on the moon:

The corresponding notebook is available here.

4) Reinforce in Classic Control environments

The Agent and the training loop that have been developed in the previous section can be use in many different environments, even if it is not a Box2D one. Indeed, we can, for instance, use it in the CartPole environment which is a Classic Control one.

The corresponding notebook is here and the final agent after 750 episodes training behaves as follow:

Conclusion

In previous articles, we explored different areas of computer vision using neural networks. Here, we have addressed a new area called Reinforcement Learning that uses neural networks to create agents capable of performing a task.

In particular, we learned how to apply q-learning to a given problem. We also learned how to create an agent based on a neural network and train it to perform a task such as landing on a specific position or preventing a pendulum from falling. I hope this will help others who want to learn more about neural networks.


Here you can find my project:

https://github.com/Apiquet/Reinforcement_learning