Reinforcement Learning in Python, Part 1: Welcome to RL

Reinforcement Learning is a framework for tackling sequential decision problems: what to do next in order to maximize a reward (which might be delayed), on a changing universe (which might react to our actions).

Examples of this include:

  • Game playing: which actions are critical to win a game?
  • Learning in a “small world”: what actions maximize pleasure / minimize pain?
  • Treatment of chronical diseases: how to evolve the treatment for a disease that creates resistance?

The unifying theme on the problems above can be abstracted as follows:

  • An agent receives a signal from the environment, selected by Nature.
  • The agent executes an action.
  • Given the agents’ action, Nature assigns a reward and draws a new state, which is announced to the agent.
  • The situation repeats until a terminal criterion is reached.

An example: OpenAI Gym

We’ll use the OpenAI Gym environment for this.

It consists of a number of Atari environments that we can use for experimenting during this course. If you haven’t please install the library OpenAI gym (pip install gym).

from tqdm import tqdm
import gym

#create a single game instance
env = gym.make("FrozenLake-v0")

Here, S is the initial state, and your aim is to reach G, without falling into the holes, H. The squares marked with F are frozen, which means you can step on them.

Note: The environment is non-deterministic, you can slip in the ice and end up in a different state.

How to use the environment?

  • reset() returns the initial state / first observation.
  • render() returns the current environment state.
  • step(a) returns what happens after action a:
    • new observation: the new state.
    • reward: the reward corresponding to that action in that state.
    • is done: binary flag, True if the game is over.
    • info: Some auxiliary stuff, which we can ignore now.
 The code below is self-explanatory. If you run the three blocks within print statements in a Jupyter notebook it will make a lot of sense 🙂
print("The initial state: ", env.reset())
print(" and it looks like: ")

The initial state:  0 and it looks like: 


print("Now let's take an action: ")
new_state, reward, done, _ = env.step(1)

idx_to_action = {
    0:"<", #left
    1:"v", #down
    2:">", #right
    3:"^" #up

Now let's take an action: 


Defining a policy
  • The environment has a 4×4 grid of states.
  • For each state there are 4 possible actions.

A policy is a function from states to actions. It tells us what we should do on each state. In this case, any array of size 16×4 is a (deterministic) policy.

We can implement policies as dictionaries.

import numpy as np
n_states = env.observation_space.n
n_actions = env.action_space.n

# Initialize random_policy:
def init_random_policy():
    random_policy  = {}
    for state in range(n_states):
        random_policy[state] = np.random.choice(n_actions)
    return random_policy

We need now to define a function to evaluate our policy.

def evaluate(env, policy, max_episodes=100): 
    tot_reward = 0
    for ep in range(max_episodes):
        state = env.reset()
        done = False
        ep_reward = 0
        # Reward per episode
        while not done:
            action = policy[state]
            new_state, reward, done, _ = env.step(action)
            ep_reward += reward
            state = new_state
            if done:
                tot_reward += ep_reward
    return tot_reward/(max_episodes)

Looking for the best policy: Random search

As a very first example, let’s try to find our policy by pure random search: we will play for some time and keep track of the best actions we can do on each state.

FrozenLake-v0 defines “solving” as getting average reward of 0.78 over 100 consecutive trials.

best_policy = None
best_score = -float('inf')

# Random search
for i in range(1,10000): #tip: you can use tqdm(range(1,10000)) for a progress bar
    policy = init_random_policy()
    score = evaluate(env,policy,100)
    if score > best_score:
        best_score = score
        best_policy = policy
    if i%5000 == 0:
        print("Best score:", best_score)
print("Best policy:")

running the code above we get:

Best score: 0.45
Best policy:
{0: 2, 1: 3, 2: 0, 3: 1, 4: 0, 5: 3, 6: 0, 7: 3, 8: 3, 9: 1, 10: 0, 11: 1, 12: 3, 13: 2, 14: 2, 15: 1}
# Let's see the policy in action
def play(env,policy, render=False):
    s = env.reset()
    d = False
    while not d:
        a = policy[s]
        print("State: ",s)
        print("Action: ",idx_to_action[a])
        s, r, d, _ = env.step(a)
        if render:
        if d:

Let’s create a small function to print a nicer policy:

def print_policy(policy):
    arrows = [idx_to_action[policy[i]] 
               if lake[i] in 'SF' else '*' for i in range(n_states)]
    for i in range(0,16,4):


We can call then use the functions above to render the optimal policy. Note that the policy might not give the optimal action: recall that there is some noise involved (you can slip on the ice) which is responsible of a non-optimal action looking like optimal.



Using a different policy

Let’s try some different implementation of a random policy, which will be more useful later on.

# theta = 0.25*np.ones((n_states,n_actions))
def random_parameter_policy(theta):
    theta = theta/np.sum(theta, axis=1, keepdims=True) # ensure that the array is normalized
    policy  = {}
    probs = {}
    for state in range(n_states):
        probs[state] = np.array(theta[state,:])
        policy[state] = np.random.choice(n_actions, p = probs[state])
    return policy


best_policy = None
best_score = -float('inf')
alpha = 1e-2
# Random search
for i in range(1,10000): 
    theta = 0.25*np.ones((n_states,n_actions))
    policy = random_parameter_policy(theta)
    score = evaluate(env,policy,100)
    if score > best_score:
        best_score = score
        best_policy = policy
    theta = theta + alpha*(score-best_score)*np.ones((n_states,n_actions))
    if i%5000 == 0:
        print("Best score:", best_score)
print("Best policy:")
print("Best score:", best_score)

the outcome is:

Best score: 0.39
Best policy:
{0: 0, 1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 2, 8: 3, 9: 1, 10: 0, 11: 1, 12: 1, 13: 2, 14: 3, 15: 1}
Best score: 0.53


What’s the advantage of this? Perhaps not much right now, but this is the first step to use more sophisticated techniques over random search. Note that we do a “gradient update” of sorts when we change the parameter theta in the direction of increase of the best score. This hints that we could use other update rules, perhaps taking the output as a more sophisticated input of the game history.

Another thing to notice is that we made effectively our policy stochastic: at every stage the agent has the possibility of choosing randomly his action. This has the effect of smoothing out the problem: we are now solving an optimization problem on a continuous, instead of a discrete space.

Your turn:

  • Beat the hill climbing / random search benchmark! Implement a different search method for the parameters.
  • Try the CartPole environment. In CartPole, the state is continuous (4 different parameters), so you need to do something on the lines of the parameterized random search example. Look at this blog post  for inspiration.


Do you want to learn reinforcement learning on a live tutorial? Join our August 23-24 session!

Multi-armed bandits, part 1

The $$k$$ armed bandit problem can be phrased as follows:

  • There are $$k$$ possible actions, $$\{a_1, a_2, \ldots , a_k\}.$$ Actions are sometimes called arms (hence $$k-$$ armed).
  • At time $$t = 1, 2, \ldots ,$$ an action is chosen and a reward $$R_t$$
    is allocated.
  • Each action gives an unknown reward, sampled from an unknown
    probability distribution.
  • The reward depends only on the action.
  • Goal: Maximize total reward.

These are a simpler class of reinforcement learning problems on which there is no influence from the state. Our actions have a noisy effect in the reward and our goal is to learn from these noisy signals. However, the effect of our actions on the reward is not influenced by any other externalities.

Bandit algorithms are useful in a number of applications:

  • Personalized content/news.
  • Online advertising (A/B testing and similar). You have probably heard of A/B testing. A/B testing is a bandit algorithm with a pure exploration phase, followed by pure exploitation. Don’t worry, this will be clear soon.
    Clinical trials (experimental treatments for patients while
    minimizing losses).
  • Portfolio Optimization.

Solving bandit algorithms

Let’s figure out how to deal with bandits before moving on to harder reinforcement learning problems.

As we said for every action $$a_i$$, the reward might be still random, but it only
depends on the action.

$$ q_*(a_i) := \mathbb E \left[ R_t \ | \ A_t = a_i \right], \ \ i = 1, 2, \ldots ,k.$$

To maximize rewards, we create some estimates of the $$q_*$$ function
above, $$Q_t(a_i) := q_*(a), \ i = 1, 2, \ldots k$$ and use those estimates to choose the best actions. The $$q_*$$ function is the best possible return we could get from each action, whereas $$Q_t$$ represents our current estimate of the $$q_*$$ function based on what we just learned from the interaction.

One way to learn about $$q_*$$ is to keep track of the average rewards.

$$ Q_t(a_i) := \frac{\text{total reward of } a_i \text{ before } t}{\text{times } a_i \text{ was played time } t} = \frac{ \sum_{j=1}^{t-1}R_j\cdot 1_{A_j=a_i}}{\sum_{j=1}^{t-1} 1_{A_j=a_i}}.$$

By the law of large numbers, if the action is taken infinitely many times,

$$ \lim_{N_t(a_i) \to +\infty} Q_t(a_i) = q_*(a_i).$$

Exploration vs Exploitation

A dilemma that arises in many decision problems is the trade-off between exploration and exploitation. This simply means that if we spend a lot of time trying things randomly, we are risking our reward in the long term, out of low commitment to well-performing actions. Conversely, sticking to something too soon might fire back: imagine we try 2 out of 10 different arms of our bandit, and we just stick to the best of those two. We might be ignoring a much better arm if we don’t get adventurous and try it! When we just commit to such action (i.e. the best action seen so far) we say we use a greedy policy.

  • The greedy policy at time $$t$$ is choosing actions $$A_t^*$$ such
    $$A_t^* \ \in \ \mathrm{argmax}_{i = 1, 2, \ldots ,k} Q_t(a_i).$$
  • If at time $$t$$ we choose action $$A_t$$ such that $$A_t = A_t^*$$ is exploiting the action, whereas
    $$A_t \neq A_t^*$$ means exploring.

One way to deal with the exploration-exploitation trade-off is to be $$\epsilon$$-greedy

  • Create an action-value table (a Python dictionary) storing the
    history of the payoffs obtained by each action.
  • Calculate the average payoff of each action.
  • Choose with probability $$\epsilon$$ a non-optimal action.
  • Choose with probability $$1-\epsilon$$ one of the greedy actions
    (those with highest average payoff).
  • Break ties randomly.

The choice of $$\epsilon$$ is important, and we can observe different types of behaviour.

The figures below illustrate the rewards an agent gets for choosing different arms with different reward probabilities (shown in the third image from the right). The reward is 1 when you get it. So we see that the very conservative player (with small $$\epsilon$$ get stuck with a single action from the beginning.

On the opposite behaviour, with a value of $$\epsilon=1$$, we see a player that does a bit better and explores a wider range of actions (in the second picture). However, the mean reward is barely around the average. This is because he is just playing randomly.

Finally, we have a more balanced agent, which gets a much higher reward in average because is able to identify the correct actions to use with time.

Interested? Check out Part 2! Or the book in Leanpub.

Multi-armed bandits, part 2


There are two important parts for the implementation: on one hand, we have to implement an environment that simulates the reward of the arms. The skeleton of this class is given below:

class Arm(object):
    def __init__(self, params):
        ## passes the required parameters
        ## this could be the reward probability, or other parameter (mean, sigma) of a lottery
        ## that gives the rewards.

    def step(self):
        ## gives a reward

From the point of view of the agent, we need to implement the decision making mechanism. The skeleton of this class would be:

class Bandit(object):
    def __init__(self, *args, **kwargs):

    def reset(self):

    def choose_action(self):
        ## Chooses a pure action, i.e. an arm

    def update(self, action, reward):
        #### Updates the values of N and Q

As for the $$\epsilon$$ greedy part:

import random
import numpy as np

def argmax(array):
    return array.index(max(array))

class EpsilonGreedy:
    def __init__(self, epsilon, n_actions):
        self.Q = np.zeros(n_actions)
        self.N = np.zeros(n_actions)
        self.epsilon = epsilon     
        self.n_actions = n_actions

    def choose_action(self):
        if random.random()>self.epsilon:
            a = argmax(self.Q)
            a = random.randint(0,self.n_actions-1)
        return a
    def update(self, action, reward):
        n = self.N[action]
        self.Q[action] = self.Q[action]+1/(self.n_actions)*(reward-self.Q[action])

In reality, we can also change the value of $$\epsilon$$ at every iteration, that is $$\epsilon = \epsilon_t$$, to reduce the exploration rate. As the value of $$\epsilon$$ goes to zero, we would approach the greedy policy. Under a good $$\epsilon$$ reduction schedule, we would stop the exploration after identifying the good actions. As examples of cooling schedules we have $$\epsilon_t = 1/t$$ or $$\epsilon_t = \epsilon^t$$.

Other action selection methods

Besides $$\epsilon$$-greedy, other action selection methods can be used. We will briefly describe some of them.

Softmax action selection

It turns out that $$\epsilon$$-greedy has the (sometimes unpleasant) property of choosing a really bad action with the same probability as choosing some not-too-bad action. To avoid this, we can choose our action via a mixed strategy $$\pi_t$$ (that is, randomizing), using the softmax function:

$$\pi_t(a_i):= \frac{e^{Q_t(a_i)/\tau}}{\sum^k_{m=1}e^{Q_t(a_m)/\tau}}.$$

Here $$\tau$$ is a parameter, called the temperature. In the limit, when $$\tau \to 0$$, softmax action selection behaves like $$\epsilon$$-greedy.

For both softmax and $$\epsilon$$-greedy action selection methods, we can decrease the parameters to zero as the number of steps of the episode decreases. This process is called annealing and it causes the bandit
algorithm to explore less over time.

Upper Confidence Bound action selection

In this method, we want to be sure of the performance of an action before choosing it. Formally, regret-based estimates are used to give us upper bounds on the true action values. The action selection mechanism is:

$$A_t := \mathrm{argmax}_{i = 1, \ldots ,k} \left[ Q_t(a_i) + 2 \cdot C\sqrt{\frac{\log t}{N_t(a_i)}} \right].$$

where $$C$$ is an upper bound for the reward. This gives us a way to control randomness, but it comes at a price. UCB suffers from back-pedaling, which roughly means “I know this is bad, but I’m not completely sure about how bad it is”. So you end up choosing bad actions because you are not completely sure of their true value. It also happens that UCB algorithms may not become strictly greedy.

Reinforcement Comparison

Last but not least, a method closely related to actor-critic algorithms in reinforcement learning. The motivation behind is simple: How much reward is a good reward? Once we decide on this, we can choose our actions accordingly. The algorithm works as follows:

  • Set $$p_t(a_i)$$ a learned preference for taking action $$a_i$$.
  • The policy
    $$\pi_t(a_i):= \frac{e^{p_t(a_i)}}{\sum^k_{m=1}e^{p_t(a_m)}}$$ is a
    mixed strategy.
  • The preference is updated as:
    $$p_t(a_i) = p_{t-1}(a) + \alpha(R_{t-1}-\bar{R}_{t-1})(1_{A_{t-1}=a_i}-\pi_{t-1}(a_i))$$


$$\bar{R}_{t-1} := \frac{1}{t-1}\sum_{j=1}^{t-1}R_j.$$ Note that this reduces to softmax selection for a particular choice of the $$p$$ function.

Diagnosing a bandit algorithm

There are different metrics to see how a bandit algorithm is performing.

  • Method 1: Measure how often is the best arm chosen
    • That sounds ok, but what if the arms have similar rewards? This metric does not really make sense.
  • Method 2: Average reward at each point in time (is it really
    getting better?)

    • However, this would yield bad average reward for algorithms that explore a lot. So we should be aware of this when we tweak the hyperparameters of our algorith ($$\epsilon$$ in this case).
  • Method 3: Cumulative rewards.
    • This is a “Big-picture” metric: we can see if the cost of exploration was worth.


Let’s illustrate a couple of application domains.

Clinical trials

In clinical trials, the goal is to evaluate $$k$$ possible treatments for a disease. Patients are allocated into $$k$$ groups and the reward is 0 or 1, depending on success of the treatment. This is a typical example where one would rather use softmax than $$\epsilon$$-greedy, because it’s not the same to choose a random treatment than a second most likely to work. Also, annealing should come in, because in later stages of the trial, a greater fraction of the subjects should be assigned to better treatments.

Internet advertising

Each time you visit a website the publisher must choose to display one of $$k$$ ads, and the reward for the publisher is 0/1 (no click/click). However, not all users are alike, and they are not alike to themselves in time. Ads should be in context of users’ activity. For this, a different class of algorithms, called contextual bandits is used. A contextual bandit receives an external signal to know in which context you are, and find, for each context, the best actions for that context. This is almost a reinforcement learning problem, except that the context has no influence in the dynamics to the next state.

A major issue in real life deployments is that reward functions are messy and ill-specified. An amusing example of this can be seen in, where an agent learns to maximize its reward function as the score on a videogame. However, the agent learns the periodicity in which some bonus-giving tokens are regenerated on screen, and turns over and over in circles instead of finishing the episode.


Interested? Check out Part 1! Or the book in Leanpub.