Reinforcement Learning in Python, Part 1: Welcome to RL

Reinforcement Learning is a framework for tackling sequential decision problems: what to do next in order to maximize a reward (which might be delayed), on a changing universe (which might react to our actions).

Examples of this include:

  • Game playing: which actions are critical to win a game?
  • Learning in a “small world”: what actions maximize pleasure / minimize pain?
  • Treatment of chronical diseases: how to evolve the treatment for a disease that creates resistance?

The unifying theme on the problems above can be abstracted as follows:

  • An agent receives a signal from the environment, selected by Nature.
  • The agent executes an action.
  • Given the agents’ action, Nature assigns a reward and draws a new state, which is announced to the agent.
  • The situation repeats until a terminal criterion is reached.

An example: OpenAI Gym

We’ll use the OpenAI Gym environment for this.

It consists of a number of Atari environments that we can use for experimenting during this course. If you haven’t please install the library OpenAI gym (pip install gym).

from tqdm import tqdm
import gym

#create a single game instance
env = gym.make("FrozenLake-v0")

Here, S is the initial state, and your aim is to reach G, without falling into the holes, H. The squares marked with F are frozen, which means you can step on them.

Note: The environment is non-deterministic, you can slip in the ice and end up in a different state.

How to use the environment?

  • reset() returns the initial state / first observation.
  • render() returns the current environment state.
  • step(a) returns what happens after action a:
    • new observation: the new state.
    • reward: the reward corresponding to that action in that state.
    • is done: binary flag, True if the game is over.
    • info: Some auxiliary stuff, which we can ignore now.
¬†The code below is self-explanatory. If you run the three blocks within print statements in a Jupyter notebook it will make a lot of sense ūüôā
print("The initial state: ", env.reset())
print(" and it looks like: ")
env.render()

The initial state:  0 and it looks like: 

SFFF
FHFH
FFFH
HFFG

print("Now let's take an action: ")
new_state, reward, done, _ = env.step(1)
env.render()


idx_to_action = {
    0:"<", #left
    1:"v", #down
    2:">", #right
    3:"^" #up
}

Now let's take an action: 
  (Down)
SFFF
FHFH
FFFH
HFFG

 

Defining a policy
  • The environment has a 4×4 grid of states.
  • For each state there are 4 possible actions.

A policy is a function from states to actions. It tells us what we should do on each state. In this case, any array of size 16×4 is a (deterministic) policy.

We can implement policies as dictionaries.

import numpy as np
n_states = env.observation_space.n
n_actions = env.action_space.n

# Initialize random_policy:
def init_random_policy():
    random_policy  = {}
    for state in range(n_states):
        random_policy[state] = np.random.choice(n_actions)
    return random_policy

We need now to define a function to evaluate our policy.

def evaluate(env, policy, max_episodes=100): 
    tot_reward = 0
    for ep in range(max_episodes):
        state = env.reset()
        done = False
        ep_reward = 0
        
        # Reward per episode
        while not done:
            action = policy[state]
            new_state, reward, done, _ = env.step(action)
            ep_reward += reward
            state = new_state
            if done:
                tot_reward += ep_reward
    return tot_reward/(max_episodes)

Looking for the best policy: Random search

As a very first example, let’s try to find our policy by pure random search: we will play for some time and keep track of the best actions we can do on each state.

FrozenLake-v0 defines “solving” as getting average reward of 0.78 over 100 consecutive trials.

best_policy = None
best_score = -float('inf')

# Random search
for i in range(1,10000): #tip: you can use tqdm(range(1,10000)) for a progress bar
    policy = init_random_policy()
    score = evaluate(env,policy,100)
    if score > best_score:
        best_score = score
        best_policy = policy
    if i%5000 == 0:
        print("Best score:", best_score)
print("Best policy:")
print(best_policy)

running the code above we get:

Best score: 0.45
Best policy:
{0: 2, 1: 3, 2: 0, 3: 1, 4: 0, 5: 3, 6: 0, 7: 3, 8: 3, 9: 1, 10: 0, 11: 1, 12: 3, 13: 2, 14: 2, 15: 1}
# Let's see the policy in action
def play(env,policy, render=False):
    s = env.reset()
    d = False
    while not d:
        a = policy[s]
        print("*"*10)
        print("State: ",s)
        print("Action: ",idx_to_action[a])
        s, r, d, _ = env.step(a)
        if render:
            env.render()
        if d:
            print(r)

Let’s create a small function to print a nicer policy:

def print_policy(policy):
    lake = "SFFFFHFHFFFHHFFG"
    arrows = [idx_to_action[policy[i]] 
               if lake[i] in 'SF' else '*' for i in range(n_states)]
    for i in range(0,16,4):
        print(''.join(arrows[i:i+4]))

 

We can call then use the functions above to render the optimal policy. Note that the policy might not give the optimal action: recall that there is some noise involved (you can slip on the ice) which is responsible of a non-optimal action looking like optimal.

print_policy(best_policy)
play(env,best_policy)

 

Using a different policy

Let’s try some different implementation of a random policy, which will be more useful later on.

# theta = 0.25*np.ones((n_states,n_actions))
def random_parameter_policy(theta):
    theta = theta/np.sum(theta, axis=1, keepdims=True) # ensure that the array is normalized
    policy  = {}
    probs = {}
    for state in range(n_states):
        probs[state] = np.array(theta[state,:])
        policy[state] = np.random.choice(n_actions, p = probs[state])
    return policy

 

best_policy = None
best_score = -float('inf')
alpha = 1e-2
# Random search
for i in range(1,10000): 
    theta = 0.25*np.ones((n_states,n_actions))
    policy = random_parameter_policy(theta)
    score = evaluate(env,policy,100)
    if score > best_score:
        best_score = score
        best_policy = policy
    theta = theta + alpha*(score-best_score)*np.ones((n_states,n_actions))
    if i%5000 == 0:
        print("Best score:", best_score)
print("Best policy:")
print(best_policy)
print("Best score:", best_score)

the outcome is:

Best score: 0.39
Best policy:
{0: 0, 1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 2, 8: 3, 9: 1, 10: 0, 11: 1, 12: 1, 13: 2, 14: 3, 15: 1}
Best score: 0.53

 


What’s the advantage of this? Perhaps not much right now, but this is the first step to use more sophisticated techniques over random search. Note that we do a “gradient update” of sorts when we change the parameter theta in the direction of increase of the best score. This hints that we could use other update rules, perhaps taking the output as a more sophisticated input of the game history.

Another thing to notice is that we made effectively our policy stochastic: at every stage the agent has the possibility of choosing randomly his action. This has the effect of smoothing out the problem: we are now solving an optimization problem on a continuous, instead of a discrete space.

Your turn:

  • Beat the hill climbing / random search benchmark! Implement a different search method for the parameters.
  • Try the CartPole environment. In CartPole, the state is continuous (4 different parameters), so you need to do something on the lines of the parameterized random search example. Look at this blog post¬† for inspiration.

 

Do you want to learn reinforcement learning on a live tutorial? Join our August 23-24 session!

How to Scrape a job listing from StackOverflow 

In this guest post, Michael Heydt, the author of the Python Web Scraping Cookbook, shows how to scrape a job listing from StackOverflow using Python.

It’s Easy to Scrape Stack Overflow

StackOverflow actually makes it quite easy to scrape data from their pages.¬†In this article, you‚Äôll be shown how to use content from a posting at¬†https://stackoverflow.com/jobs/122517/spacex-enterprise-software-engineer-full-stack-spacex?so=p&sec=True&pg=1&offset=22&cl=Amazon%3b+.¬† You‚Äôll note that StackOverflow job listing pages are very structured.¬† It’s probably because they’re created by programmers and for programmers.¬†The page looks like the following:

All of this information is codified within the HTML of the page.¬† You can see for yourself by analyzing the page content.¬† But what‚Äôs great about StackOverflow is that it puts much of its page data in an embedded JSON object.¬† This is placed within a¬†<script type=”application/ld+json”>¬†HTML tag, so it’s really easy to find.¬† The following shows a truncated section of this tag (the description is truncated, but all the tags are shown):

This makes it very easy to get hold of the content, as you can simply retrieve the page, find this tag, and then convert this JSON into a Python object with the¬†json¬†library.¬† In addition to the actual job description, much of the¬†“metadata”¬†of the job posting is also included, such as skills, industries, benefits, and location information.¬† You don’t need¬†to¬†search the HTML for the information‚ÄĒjust find this tag and load the JSON.¬† Please note that if you want to find specific items (such as¬†Job Responsibilities),¬†you‚Äôll still need to parse the description.¬† The description contains the full HTML, so you‚Äôll still need to deal with HTML tags while parsing it.

Start scraping a job listing from StackOverflow

It’s time to get the job description from this page; you simply have to retrieve the contents. Here’s a step by step walkthrough of the code:

  1. First, you need to read the file:
with open("spacex-job-listing.txt", "r") as file:
    content = file.read()
  1. Then, load the content into a¬†BeautifulSoup¬†object, and retrieve the¬†<script type=”application/ld+json”> tag:
bs = BeautifulSoup(content, "lxml")
script_tag = bs.find("script", {"type": "application/ld+json"})
  1. Now that you have the tag, you can simply load its contents into a Python dictionary using the json library:  
job_listing_contents = json.loads(script_tag.contents[0])
print(job_listing_contents)

The output of this looks like the following (this is truncated for brevity):

{'@context': 'http://schema.org', '@type': 'JobPosting', 'title': 'SpaceX Enterprise Software Engineer, Full 
Stack', 'skills': ['c#', 'sql', 'javascript', 'asp.net', 'angularjs'], 'description': '<h2>About this 
job</h2>\r\n<p><span>Location options: <strong>Paid relocation</strong></span><br/>
<span>Job type: <strong>Permanent</strong></span><br/><span>Experience level: <strong>Mid-
Level, Senior</strong></span><br/><span>Role: <strong>Full Stack Developer</strong></span>
<br/><span>Industry: <strong>Aerospace, Information Technology, Web Development</strong>
  1. Now do some simple tasks with this without involving HTML parsing.  For example, you can retrieve the skills required for the job with just a few lines of code:
# print the skills
for skill in job_listing_contents["skills"]:
    print(skill)

The above produces the following output:

c#
sql
javascript
asp.net
angularjs

Reading and cleaning the description in the job listing

If you take a close look, you’ll see that the description of the job listing is still in HTML.  In order to extract valuable content out of this data, you’ll need to parse this HTML and perform several other steps:

  1. Start by creating a BeautifulSoup object from the description key of the description you loaded.  Print this to see what it looks like:
desc_bs = BeautifulSoup(job_listing_contents["description"], "lxml")
print(desc_bs)

 

<p><span>Location options: <strong>Paid relocation</strong></span><br/><span>Job type: <strong>Permanent</strong></span><br/><span>Experience level: <strong>Mid-Level, Senior</strong></span><br/><span>Role: <strong>Full Stack Developer</strong></span><br/><span>Industry: <strong>Aerospace, Information Technology, Web Development</strong></span><br/><span>Company size: <strong>1k-5k people</strong></span><br/><span>Company type: <strong>Private</strong></span><br/></p><br/><br/><h2>Technologies</h2> <p>c#, sql, javascript, asp.net, angularjs</p> <br/><br/><h2>Job description</h2> <p><strong>Full Stack Enterprise Software Engineer</strong></p>
<p>The EIS (Enterprise Information Systems) team writes the software that builds rockets and powers SpaceX. We are responsible for all of the software on the factory floor, the warehouses, the financial systems, the restaurant, and even the public home page. Elon has called us the "nervous system" of SpaceX because we connect all of the other teams at SpaceX to ensure that the entire rocket building process runs smoothly.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who are comfortable tackling new problems, innovating solutions, and interacting with every facet of the company on a daily basis. Creative, motivated, able to take responsibility and support the applications you create. Help us get rockets out the door faster!</li>
</ul>
<p><strong>Basic Qualifications:</strong></p>
<ul>
<li>Bachelor's degree in computer science, engineering, physics, mathematics, or similar technical discipline.</li>
<li>3+ years of experience developing across a full-stack:  Web server, relational database, and client-side (HTML/Javascript/CSS).</li>
</ul>
<p><strong>Preferred Skills and Experience:</strong></p>
<ul>
<li>Database - Understanding of SQL. Ability to write performant SQL. Ability to diagnose queries, and work with DBAs.</li>
<li>Server - Knowledge of how web servers operate on a low-level. Web protocols. Designing APIs. How to scale web sites. Increase performance and diagnose problems.</li>
<li>UI - Demonstrated ability creating rich web interfaces using a modern client side framework. Good judgment in UX/UI design.  Understands the finer points of HTML, CSS, and Javascript - know which tools to use when and why.</li>
<li>System architecture - Knowledge of how to structure a database, web site, and rich client side application from scratch.</li>
<li>Quality - Demonstrated usage of different testing patterns, continuous integration processes, build deployment systems. Continuous monitoring.</li>
<li>Current - Up to date with current trends, patterns, goings on in the world of web development as it changes rapidly. Strong knowledge of computer science fundamentals and applying them in the real-world.</li>
</ul> <br/><br/></body></html>
  1. You need to go through this and remove all the HTML tags so that you’re only left with the text of the description.  This is what you’ll then tokenize.  Fortunately, throwing out all the HTML tags is easy with BeautifulSoup:
just_text = desc_bs.find_all(text=True)
print(just_text)

 

['About this job', '\n', 'Location options: ', 'Paid relocation', 'Job type: ', 'Permanent', 'Experience level: ', 'Mid-Level, Senior', 'Role: ', 'Full Stack Developer', 'Industry: ', 'Aerospace, Information Technology, Web Development', 'Company size: ', '1k-5k people', 'Company type: ', 'Private', 'Technologies', ' ', 'c#, sql, javascript, asp.net, angularjs', ' ', 'Job description', ' ', 'Full Stack Enterprise\xa0Software Engineer', '\n', 'The EIS (Enterprise Information Systems) team writes the software that builds rockets and powers SpaceX. We are responsible for all of the software on the factory floor, the warehouses, the financial systems, the restaurant, and even the public home page. Elon has called us the "nervous system" of SpaceX because we connect all of the other teams at SpaceX to ensure that the entire rocket building process runs smoothly.', '\n', 'Responsibilities:', '\n', '\n', 'We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who are comfortable tackling new problems, innovating solutions, and interacting with every facet of the company on a daily basis. Creative, motivated, able to take responsibility and support the applications you create. Help us get rockets out the door faster!', '\n', '\n', 'Basic Qualifications:', '\n', '\n', "Bachelor's degree in computer science, engineering, physics, mathematics, or similar technical discipline.", '\n', '3+ years of experience developing across a full-stack:\xa0 Web server, relational database, and client-side (HTML/Javascript/CSS).', '\n', '\n', 'Preferred Skills and Experience:', '\n', '\n', 'Database - Understanding of SQL. Ability to write performant SQL. Ability to diagnose queries, and work with DBAs.', '\n', 'Server - Knowledge of how web servers operate on a low-level. Web protocols. Designing APIs. How to scale web sites. Increase performance and diagnose problems.', '\n', 'UI - Demonstrated ability creating rich web interfaces using a modern client side framework. Good judgment in UX/UI design.\xa0 Understands the finer points of HTML, CSS, and Javascript - know which tools to use when and why.', '\n', 'System architecture - Knowledge of how to structure a database, web site, and rich client side application from scratch.', '\n', 'Quality - Demonstrated usage of different testing patterns, continuous integration processes, build deployment systems. Continuous monitoring.', '\n', 'Current - Up to date with current trends, patterns, goings on in the world of web development as it changes rapidly. Strong knowledge of computer science fundamentals and applying them in the real-world.', '\n', ' ']

Superb; this is already broken down into what can be considered sentences!

  1. Join all these together, tokenize them, get rid of stop words, and then apply common tech job 2-grams:
joined = ' '.join(just_text)
tokens = word_tokenize(joined)

 

stop_list = stopwords.words('english')
with_no_stops = [word for word in tokens if word not in stop_list]
cleaned = remove_punctuation(two_grammed)
print(cleaned)

This yields the following output:

 ['job', 'Location', 'options', 'Paid relocation', 'Job', 'type', 'Permanent', 'Experience', 'level', 'Mid-Level', 'Senior', 'Role', 'Full-Stack', 'Developer', 'Industry', 'Aerospace', 'Information Technology', 'Web Development', 'Company', 'size', '1k-5k', 'people', 'Company', 'type', 'Private', 'Technologies', 'c#', 'sql', 'javascript', 'asp.net', 'angularjs', 'Job', 'description', 'Full-Stack', 'Enterprise Software', 'Engineer', 'EIS', 'Enterprise', 'Information', 'Systems', 'team', 'writes', 'software', 'builds', 'rockets', 'powers', 'SpaceX', 'responsible', 'software', 'factory', 'floor', 'warehouses', 'financial', 'systems', 'restaurant', 'even', 'public', 'home', 'page', 'Elon', 'called', 'us', 'nervous', 'system', 'SpaceX', 'connect', 'teams', 'SpaceX', 'ensure', 'entire', 'rocket', 'building', 'process', 'runs', 'smoothly', 'Responsibilities', 'seeking', 'developers', 'demonstrable experience', 'ASP.NET', 'C#', 'SQL Server', 'AngularJS', 'fast-paced', 'highly iterative', 'team', 'adapt quickly', 'factory', 'grows', 'need', 'people', 'comfortable', 'tackling', 'new', 'problems', 'innovating', 'solutions', 'interacting', 'every', 'facet', 'company', 'daily', 'basis', 'Creative', 'motivated', 'able', 'take', 'responsibility', 'support', 'applications', 'create', 'Help', 'us', 'get', 'rockets', 'door', 'faster', 'Basic', 'Qualifications', 'Bachelor', "'s", 'degree', 'computer science', 'engineering', 'physics', 'mathematics', 'similar', 'technical', 'discipline', '3+', 'years', 'experience', 'developing', 'across', 'full-stack', 'Web server', 'relational database', 'client-side', 'HTML/Javascript/CSS', 'Preferred', 'Skills', 'Experience', 'Database', 'Understanding', 'SQL', 'Ability', 'write', 'performant', 'SQL', 'Ability', 'diagnose', 'queries', 'work', 'DBAs', 'Server', 'Knowledge', 'web', 'servers', 'operate', 'low-level', 'Web', 'protocols', 'Designing', 'APIs', 'scale', 'web', 'sites', 'Increase', 'performance', 'diagnose', 'problems', 'UI', 'Demonstrated', 'ability', 'creating', 'rich', 'web', 'interfaces', 'using', 'modern', 'client-side', 'framework', 'Good', 'judgment', 'UX/UI', 'design', 'Understands', 'finer', 'points', 'HTML', 'CSS', 'Javascript', 'know', 'tools', 'use', 'System', 'architecture', 'Knowledge', 'structure', 'database', 'web', 'site', 'rich', 'client-side', 'application', 'scratch', 'Quality', 'Demonstrated', 'usage', 'different', 'testing', 'patterns', 'continuous integration', 'processes', 'build', 'deployment', 'systems', 'Continuous monitoring', 'Current', 'date', 'current trends', 'patterns', 'goings', 'world', 'web development', 'changes', 'rapidly', 'Strong', 'knowledge', 'computer science', 'fundamentals', 'applying', 'real-world']

You now have a very nice and refined set of keywords pulled out of the job listing.

If you enjoyed this article and want to learn more about several other web scraping concepts, you can always refer to this book, Python Web Scraping Cookbook. With over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS, the book is a must-have for Python programmers, web administrators, security professionals, and anyone interested in performing web analytics.

3 tricks for efficient data science

It’s amazing how little things can turn a data science / mathematical modelling project into a full-fledged mess.¬† There are easily avoidable traps, but unfortunately also easily forgotten: we love to pretend we “couldn’t do something that silly”.

Following the principles below has saved me (and others) a lot of time and spared me from unnecessary pain:

Always check your data

No one has ever been arrested for excess logging. If you don’t check the input for your models and¬†the outputs of calculations at the intermediate steps, you are in for disappointing results.¬†

There is a promise of a greener future when deep learning will release us from all the mess of pre-processing, and we can ingest raw data in any format and the intermediate layers will sort it out. Sorry, but here on real life we are still too far from that.  Garbage in, garbage out.

Test your model on limit cases

This is somewhat related to the previous point, but not always. What happens to your model in the limit? Many scientists ask these kind of questions as a basic sanity check. The same should be done: does your model predict what the domain expert / common sense expects when you make one variable too large or too small? If not, then there is something wrong. Do you have a positive regression coefficient for monthly totals, but a negative for average totals? Time to revisit your model.

A common source of errors here is the scale: having two variables on wildly different scales often produces odd, counter-intuitive models.

Fail fast

A common error, specially in highly technical teams, is to over-engineer prematurely. There is a justified fear from that: if the data science experiments become too messy, then it will be too complicated for the engineers to implement them when it’s time to scale. But very rarely this justifies discarding the working Python prototype for writing everything from scratch in C.

Another common error is to try the “tool of the day” and build the solution in a poorly supported technology. There are thousands of poorly maintained and documented open source projects.

Don’t reinvent the wheel. Bet on tested technologies to iterate quickly and, ultimately, bring value to your final user, whether a consulting client, or someone else internally in your organization.¬† Once the project is on its way and there’s trust on the final user side, there will be plenty of time to explore alternatives.

 

Cross-entropy method in R

The cross entropy method is a general approach to optimization that relies in two nice ideas. In the context of finding the maximum of a scalar-valued function this means:

  1. Generate some random parameters and evaluate the function there.
  2. Use the best values of the parameter to generate new candidates.

One simple way of generating such random parameters is fitting a normal distribution at every iteration: we choose a subset of “elite parameters” (a fraction of those we tried), calculate their mean and covariance and use it to generate a new population of parameters.

This kind of optimization methods is quite useful for instance in reinforcement learning.

For instance, suppose we want to maximize the function:

# the function we want to optimize
f <-  function(theta){
  reward = -sum((solution - theta)**2)
  return(reward)
}

  solution <- c(0.5, 0.1, -0.3)

Then the cross-entropy algorithm in this case is:

library(MASS)

cem <- function(f, n_iter, theta_mean, theta_std, batch_size=25, elite_frac=0.2){
  # Now, for the algorithms
  for(it in 1:n_iter){
    # Sample parameter vectors
    thetas <-  matrix(mvrnorm(n=batch_size*dim_theta, mu= theta_mean, Sigma=theta_std)
, ncol = dim_theta)
    rewards <- apply(thetas,1,f) 
    # Get elite parameters
    n_elite <-  as.integer(batch_size * elite_frac)
    elite_inds <-  sort(rewards, decreasing = T, index.return=T)$ix[1:n_elite]
    elite_thetas <- thetas[elite_inds,]
    # Update theta_mean, theta_std
    theta_mean <- apply(elite_thetas, 2,mean)
    theta_std <- 0.01*diag(dim_theta)+0.99*cov(elite_thetas)
  }
 return(theta_mean)
}  

cem(f,300)

and we call this like:

  dim_theta <-  length(solution)
  theta_mean <-  matrix(0,dim_theta,1)
  theta_std <-  diag(dim_theta)
  batch_size <-  25 # number of samples per batch
  elite_frac <-  0.2 # fraction of samples used as elite set
  
cem(f,300, theta_mean=theta_mean, theta_std=theta_std
, batch_size=batch_size, elite_frac=elite_frac)


Regret minimization in Python

Since my PhD (game theory, but heavy in mathematical analysis and partial differential equations) I was interested in nice, simple algorithms that would beat fancy theoretical constructs. One of those happens to be regret minimization.

Suppose we want to play rock, paper, scissors for money. Each of us puts 100 EUR on the table, we play and the winner takes the money. ¬†Let’s further assume that you play rock and I play paper (so I win) and thus take the 100 EUR from the table. In this case, your regret for not playing paper is 100 (because we would have tied), but your regret for not paying scissors is 200, because you did not win.

Formally, if I, J denote the strategy sets of the players (here I=J and both are equal to {rock, paper, scissors}), then the regret for player 1’s action i, given the last-round moves i^*, j^* is:

 \max(u(i,j^*)-u(i^*,j^*),0).

How do we get a strategy out of this? We initialize a counter at the beginning of the game. We play randomly in the beginning, and from the second round on, we compute the regrets for each action and divide each of these numbers by the sum of regrets of our actions. We will choose the action with a random strategy, playing with probability proportional to the normalized cumulative regret.

Let’s take a look at the code in Python:

import random
import numpy as np

# Base player that chooses a constant mixed strategy every time
class Player:
    def __init__(self):
        self.my_moves = []
        self.other_moves = []

    def move(self, strategy):
        # Input: a vector of probability distributions for actions
        # Output: a pure action
        r = random.uniform(0,1)
        n_actions = len(strategy)
        a = 0
        cumulative_proba = 0.0
        while a<n_actions-1:
            cumulative_proba += strategy[a] 
            if r < cumulative_proba: return a
            a +=1
        return a

class RegretPlayer(Player):
    def __init__(self):
        super(RegretPlayer, self).__init__()
        self.regret_sum = np.zeros(3)

    def regret(self):

        if len(self.my_moves)>0:
            my_action = self.my_moves[-1]
            his_action = self.other_moves[-1]
        else:
            return np.zeros(3)

        # Payoffs from my perspective
        my_payoff = np.zeros(3)

        # If we play the same, I don't get any payoff
        my_payoff[his_action] = 0.

        # I win when he plays scissors and I pay rock, 
        # or when I play the "next" (Rock = 0, Paper = 1, Scissors = 2)
        my_payoff[0 if his_action == 2 else his_action + 1] = 1

        # I lose when I play scissors and he plays rock, 
        # or when I play the "previous" action         
        my_payoff[2 if his_action == 0 else his_action -1] = -1

        regrets = [my_payoff[a]-my_payoff[my_action] for a in range(3)]
        regrets = np.array(regrets)
        return regrets

    def get_regret_mixed_strategy(self):
        normalize_sum = 0.0
        strategy = np.zeros(3)
        regret = self.regret()                   

        for a in range(3):
            strategy[a] = max(self.regret_sum[a],0)
            normalize_sum += strategy[a]
        # If all regrets are positive, play randomly
        if normalize_sum > 0:
            strategy = strategy / normalize_sum
        else:
            strategy = np.ones(3)/3
        self.regret_sum += regret
        return strategy            

 

We can simulate the game of the regret player:

 

import matplotlib.pyplot as plt

def payoff(my_action, his_action):
    if my_action == his_action: 
        return 0
    if his_action == 0 and my_action==2 or my_action == his_action-1:
        return -1
    return 1

def run(n_rounds = 10):
    p1 = RegretPlayer()
    p2 = Player() 
    total_p1 = 0.0
    total_p2 = 0.0
    rounds_won_p1 = 0.
    plt.ion()
    plt.axis([-0.1,n_rounds+0.1,-0.1,1.1])
    print("*"*100)
    print("The match begins")
    print("*"*100)
    for n in range(1,n_rounds):
        regret_strategy_p1 = p1.get_regret_mixed_strategy()
        m1 = p1.move(regret_strategy_p1)
        m2 = p2.move([0.4,0.3,0.3])
        # Players update the info of the moves
        p1.my_moves.append(m1)
        p1.other_moves.append(m2)    
        total_p1 += payoff(m1,m2)
       #### Show results
        moves_map = {0:"Rock", 1:"Paper", 2:"Scissors"}
        print('-'*50)
        print("Regret:")
        print(p1.regret_sum)
        print("Strategy:", regret_strategy_p1)
        print("My move: %s" % moves_map[m1])
        print("His move: %s" % moves_map[m2])
        print("Payoffs:")
        print(total_p1)
        print(total_p2)
        rounds_won_p1 += 1 if payoff(m1,m2) >0 else 0
        # Plot the moves
        plt.title("Percentage of rounds won using a regret strategy")
        plt.scatter(n,rounds_won_p1/n, color = "red")
        plt.show()
        plt.pause(0.1)
if __name__ == "__main__":
    run(n_rounds = 100)

Multi-armed bandits, part 1

The k armed bandit problem can be phrased as follows:

  • There are k possible actions, \{a_1, a_2, \ldots , a_k\}. Actions are sometimes called arms (hence k- armed).
  • At time t = 1, 2, \ldots , an action is chosen and a reward R_t
    is allocated.
  • Each action gives an unknown reward, sampled from an unknown
    probability distribution.
  • The reward depends only on the action.
  • Goal: Maximize total reward.

These are a simpler class of reinforcement learning problems on which there is no influence from the state. Our actions have a noisy effect in the reward and our goal is to learn from these noisy signals. However, the effect of our actions on the reward is not influenced by any other externalities.

Bandit algorithms are useful in a number of applications:

  • Personalized content/news.
  • Online advertising (A/B testing and similar). You have probably heard of A/B testing. A/B testing is a bandit algorithm with a pure exploration phase, followed by pure exploitation. Don’t worry, this will be clear soon.
    Clinical trials (experimental treatments for patients while
    minimizing losses).
  • Portfolio Optimization.

Solving bandit algorithms

Let’s figure out how to deal with bandits before moving on to harder reinforcement learning problems.

As we said for every action a_i, the reward might be still random, but it only
depends on the action.

 q_*(a_i) := \mathbb E \left[ R_t \ | \ A_t = a_i \right], \ \ i = 1, 2, \ldots ,k.

To maximize rewards, we create some estimates of the q_* function
above, Q_t(a_i) := q_*(a), \ i = 1, 2, \ldots k and use those estimates to choose the best actions. The q_* function is the best possible return we could get from each action, whereas Q_t represents our current estimate of the q_* function based on what we just learned from the interaction.

One way to learn about q_* is to keep track of the average rewards.

 Q_t(a_i) := \frac{\text{total reward of } a_i \text{ before } t}{\text{times } a_i \text{ was played time } t} = \frac{ \sum_{j=1}^{t-1}R_j\cdot 1_{A_j=a_i}}{\sum_{j=1}^{t-1} 1_{A_j=a_i}}.

By the law of large numbers, if the action is taken infinitely many times,

 \lim_{N_t(a_i) \to +\infty} Q_t(a_i) = q_*(a_i).

Exploration vs Exploitation

A dilemma that arises in many decision problems is the trade-off between exploration and exploitation. This simply means that if we spend a lot of time trying things randomly, we are risking our reward in the long term, out of low commitment to well-performing actions. Conversely, sticking to something too soon might fire back: imagine we try 2 out of 10 different arms of our bandit, and we just stick to the best of those two. We might be ignoring a much better arm if we don’t get adventurous and try it! When we just commit to such action (i.e. the best action seen so far) we say we use a greedy policy.

  • The greedy policy at time t is choosing actions A_t^* such
    that:
    A_t^* \ \in \ \mathrm{argmax}_{i = 1, 2, \ldots ,k} Q_t(a_i).
  • If at time t we choose action A_t such that A_t = A_t^* is exploiting the action, whereas
    A_t \neq A_t^* means exploring.

One way to deal with the exploration-exploitation trade-off is to be \epsilon-greedy

  • Create an action-value table (a Python dictionary) storing the
    history of the payoffs obtained by each action.
  • Calculate the average payoff of each action.
  • Choose with probability \epsilon a non-optimal action.
  • Choose with probability 1-\epsilon one of the greedy actions
    (those with highest average payoff).
  • Break ties randomly.

The choice of \epsilon is important, and we can observe different types of behaviour.

The figures below illustrate the rewards an agent gets for choosing different arms with different reward probabilities (shown in the third image from the right). The reward is 1 when you get it. So we see that the very conservative player (with small \epsilon get stuck with a single action from the beginning.

On the opposite behaviour, with a value of \epsilon=1, we see a player that does a bit better and explores a wider range of actions (in the second picture). However, the mean reward is barely around the average. This is because he is just playing randomly.

Finally, we have a more balanced agent, which gets a much higher reward in average because is able to identify the correct actions to use with time.

Interested? Check out Part 2! Or the book in Leanpub.

Multi-armed bandits, part 2

Implementation

There are two important parts for the implementation: on one hand, we have to implement an environment that simulates the reward of the arms. The skeleton of this class is given below:

class Arm(object):
    def __init__(self, params):
        ## passes the required parameters
        ## this could be the reward probability, or other parameter (mean, sigma) of a lottery
        ## that gives the rewards.
        ...

    def step(self):
        ## gives a reward
        ...

From the point of view of the agent, we need to implement the decision making mechanism. The skeleton of this class would be:

class Bandit(object):
    def __init__(self, *args, **kwargs):
        ...

    def reset(self):
        ...

    def choose_action(self):
        ## Chooses a pure action, i.e. an arm
        ...

    def update(self, action, reward):
        #### Updates the values of N and Q
        ...

As for the \epsilon greedy part:

import random
import numpy as np

def argmax(array):
    return array.index(max(array))

class EpsilonGreedy:
    def __init__(self, epsilon, n_actions):
        self.Q = np.zeros(n_actions)
        self.N = np.zeros(n_actions)
        self.epsilon = epsilon     
        self.n_actions = n_actions

    def choose_action(self):
        if random.random()>self.epsilon:
            a = argmax(self.Q)
        else:
            a = random.randint(0,self.n_actions-1)
        return a
    def update(self, action, reward):
        n = self.N[action]
        self.N[action]+=1
        self.Q[action] = self.Q[action]+1/(self.n_actions)*(reward-self.Q[action])

In reality, we can also change the value of \epsilon at every iteration, that is \epsilon = \epsilon_t, to reduce the exploration rate. As the value of \epsilon goes to zero, we would approach the greedy policy. Under a good \epsilon reduction schedule, we would stop the exploration after identifying the good actions. As examples of cooling schedules we have \epsilon_t = 1/t or \epsilon_t = \epsilon^t.

Other action selection methods

Besides \epsilon-greedy, other action selection methods can be used. We will briefly describe some of them.

Softmax action selection

It turns out that \epsilon-greedy has the (sometimes unpleasant) property of choosing a really bad action with the same probability as choosing some not-too-bad action. To avoid this, we can choose our action via a mixed strategy \pi_t (that is, randomizing), using the softmax function:

\pi_t(a_i):= \frac{e^{Q_t(a_i)/\tau}}{\sum^k_{m=1}e^{Q_t(a_m)/\tau}}.

Here \tau is a parameter, called the temperature. In the limit, when \tau \to 0, softmax action selection behaves like \epsilon-greedy.

For both softmax and \epsilon-greedy action selection methods, we can decrease the parameters to zero as the number of steps of the episode decreases. This process is called annealing and it causes the bandit
algorithm to explore less over time.

Upper Confidence Bound action selection

In this method, we want to be sure of the performance of an action before choosing it. Formally, regret-based estimates are used to give us upper bounds on the true action values. The action selection mechanism is:

A_t := \mathrm{argmax}_{i = 1, \ldots ,k} \left[ Q_t(a_i) + 2 \cdot C\sqrt{\frac{\log t}{N_t(a_i)}} \right].

where C is an upper bound for the reward. This gives us a way to control randomness, but it comes at a price. UCB suffers from back-pedaling, which roughly means ‚ÄúI know this is bad, but I’m not completely sure about how bad it is‚ÄĚ. So you end up choosing bad actions because you are not completely sure of their true value. It also happens that UCB algorithms may not become strictly greedy.

Reinforcement Comparison

Last but not least, a method closely related to actor-critic algorithms in reinforcement learning. The motivation behind is simple: How much reward is a good reward? Once we decide on this, we can choose our actions accordingly. The algorithm works as follows:

  • Set p_t(a_i) a learned preference for taking action a_i.
  • The policy
    \pi_t(a_i):= \frac{e^{p_t(a_i)}}{\sum^k_{m=1}e^{p_t(a_m)}} is a
    mixed strategy.
  • The preference is updated as:
    p_t(a_i) = p_{t-1}(a) + \alpha(R_{t-1}-\bar{R}_{t-1})(1_{A_{t-1}=a_i}-\pi_{t-1}(a_i))

where

\bar{R}_{t-1} := \frac{1}{t-1}\sum_{j=1}^{t-1}R_j. Note that this reduces to softmax selection for a particular choice of the p function.

Diagnosing a bandit algorithm

There are different metrics to see how a bandit algorithm is performing.

  • Method 1: Measure how often is the best arm chosen
    • That sounds ok, but what if the arms have similar rewards? This metric does not really make sense.
  • Method 2: Average reward at each point in time (is it really
    getting better?)

    • However, this would yield bad average reward for algorithms that explore a lot. So we should be aware of this when we tweak the hyperparameters of our algorith (\epsilon in this case).
  • Method 3: Cumulative rewards.
    • This is a ‚ÄúBig-picture‚ÄĚ metric: we can see if the cost of exploration was worth.

Examples

Let’s illustrate a couple of application domains.

Clinical trials

In clinical trials, the goal is to evaluate k possible treatments for a disease. Patients are allocated into k groups and the reward is 0 or 1, depending on success of the treatment. This is a typical example where one would rather use softmax than \epsilon-greedy, because it’s not the same to choose a random treatment than a second most likely to work. Also, annealing should come in, because in later stages of the trial, a greater fraction of the subjects should be assigned to better treatments.

Internet advertising

Each time you visit a website the publisher must choose to display one of k ads, and the reward for the publisher is 0/1 (no click/click). However, not all users are alike, and they are not alike to themselves in time. Ads should be in context of users’ activity. For this, a different class of algorithms, called contextual bandits is used. A contextual bandit receives an external signal to know in which context you are, and find, for each context, the best actions for that context. This is almost a reinforcement learning problem, except that the context has no influence in the dynamics to the next state.

A major issue in real life deployments is that reward functions are messy and ill-specified. An amusing example of this can be seen in https://blog.openai.com/faulty-reward-functions/, where an agent learns to maximize its reward function as the score on a videogame. However, the agent learns the periodicity in which some bonus-giving tokens are regenerated on screen, and turns over and over in circles instead of finishing the episode.

 

Interested? Check out Part 1! Or the book in Leanpub.

4 reasons to invest in open-source data science training

With the ever-changing technology landscape come lots of challenges for companies. Specially around data science, where we are witnessing a constant expansion in technologies and tools. Every company is becoming increasingly data-driven, and the best way to remain relevant in the next five years is to be prepared.

What concrete advantages does open-source data science bring to your organisation?

1. Develop your own use cases.

As the dust settles down and clear winners emerge (R & Python), organizations should get ready to jump into data science, identifying relevant use cases for their businesses. The focus is less and less on technologies and is finally moving onto bottom-line value. There is no one better prepared than your own, trusted and experienced employees to come up with the right use cases for data science in your business. You do not need an expensive consultancy, you have already the experts!

2. In-house customer support.

Although “open-source software” is understood as free, like in “free-beer”, it comes to a cost. Yes, you don’t have to pay the hefty yearly license, but you have to customize your software to your infrastructure and needs. A large chunk of the revenue for open source vendors comes from the ongoing support. The way out? Train your own technology experts.

3. Increased productivity and reduced employee turnover.

A happy employee is an employee that stays with you. Open source technologies are here to stay, and smart, valuable data scientists and data analysts want to work with those. Proprietary software limits their creativity and that would eventually make them leave. It is not pleasant to be locked down with the limited tooling that vendors offer, no matter how “easy” they make it sound.

4. Trained employees can be promoted.

Although there are signs that data science starts to be commoditized, with an data science and machine learning bootcamps and even master and PhD degrees everywhere, you can not really expect that any of such “instant experts” will be able to take a senior role. Investing in your employees now will save you money in the future when you are fully committed to embrace advanced analytics in your business, as the new joiners will find strong mentors and leaders within your team.

 

Interested in upgrading your analytics / open-source software in-house capabilities? Reach out!

Is it really worth learning AI/ML?

Every minute, a new data science course is born somewhere. Granted, I totally made this up but I might not be completely wrong. Lots of research groups are doing breakthroughs in data science and artificial intelligence everyday, and yes, there is hype, but there is also optimism, strongly supported by evidence.

So, is it really time to jump in and drop one’s career and everything to switch to AI? Is it finally time to join Coursera/Udacity/Edx/your favorite MOOC/university?

Well, no.

Continue reading “Is it really worth learning AI/ML?”

Where is higher education going?

Having spent most of my twenties in academia to move on later into business, I have seen my fair share of both worlds, and the disconnect between them is worrying.

We are not teaching students the skills they really need. We send them unprepared to the workplace, much to the dismay of the employer, which needs to spend valuable time and resources in training them. Why are we teaching them Lisp, when the probability of using it is close to zero? I can not believe that there are that many self-branded machine learning / artificial intelligence programs without a single course in R or Python, for instance.

I’m obviously speaking from my trench (data science), but I see that happening in other fields too. For many specializations, including humanities, a decent course in spreadsheet software is long overdue. ¬†Instead, focus is lost in highly theoretical subjects, memorizing tons of stuff that are now easily available online.

We should focus on problem solving skills. Why is it taking so long for universities to catch on?