Start building your own chatbot now!

If you haven’t yet read the reinforcement learning primer go back and check it out first here. That article will provide you with the key concepts in reinforcement learning. Then you will be ready to fully compare the different types of machine learning.

Comparing reinforcement learning to other types of ML algorithms

You may have heard about other types of machine learning ie: supervised learning, unsupervised learning, etc. Understanding how reinforcement learning (RL) differs from them is a good way to grasp the machine learning landscape.  

A high-level breakdown of the three major categories of machine learning

A high-level breakdown of the three major categories of machine learning

Supervised Learning

The easiest type of ML to grasp is supervised learning. Supervised learning is learning with human labels. Image classification is a type of supervised learning. You have an algorithm and based on labeled images the system can classify the image as a cat or a dog.  The algorithm learns from observing the training set and then can correctly infer the subject of an unknown image. "<yoastmark

Another good example of a supervised learning problem is a regression problem. In a regression problem you take a bunch of parameters and estimate a real, continuous value based on those parameters. For example you could take in information about a house (the number of rooms, square footage, number of windows, etc.) and output a price. We know what a lot of houses are worth and can feed those labeled examples into the algorithm. Then when you present a new house to the system, it can come up with a good estimate for the price on its own. These are problems that are easy to frame as a supervised learning problems.

Unsupervised Learning 

On the flip-side, we have unsupervised learning: Learning without labels. A good example of this is taking user purchase data and grouping your customers into categories with similar buying patterns. Your algorithm does the grouping and you can suggest products to people within a certain category. We do not tell the algorithm what a label or a category name is, we simply hand it a bunch of data and it creates groups based on patterns in the data. Unsupervised learning is also used extensively in visualizing a large amount of complex data. It makes it easier for a human to see all the information in one image.

Using unsupervised learning, we can find the underlying patterns in data

Using unsupervised learning, we can find the underlying patterns in data

Reinforcement Learning

Reinforcement learning is frequently described as falling somewhere in between supervised and unsupervised learning. There are time-delay labels (rewards), that are given to an algorithm as it learns to interact in an environment. An algorithm learns based on how the problem of learning is phrased. This is exactly what makes reinforcement learning excel at things like real-time decision making, video game AI, robot navigation, and other complex tasks. The key is giving the system the ability to understand which decisions are good and which ones are bad, based the current state of the environment.  


Applying these concepts 

In the previous article, we covered the basic concepts of reinforcement learning. Here is a little summary of what we have covered so far in the form of a concrete example: imagine a mouse in a basic maze. The mouse will be our agent.

To start, we will check the things our agent needs: 

  • Goal: the mouse has a goal of maximizing the amount of cheese it obtains 
  • Actions: the mouse can move in any of the four cardinal directions 
  • Senses: the mouse can observe the state of the environment it is in (start, nothing, small cheese, two small cheese, big cheese, and death). For our simple example, only having a simple sense of the state of the environment is more than enough. 
Our simple mouse agent exploring for cheese!

Our simple mouse agent exploring for cheese!

Furthermore, let’s look at the sub-elements of our problem and see how they measure up: 

  • The policy: in any given the state, which of the four actions will our mouse take?
  • The reward signal: positive (a cheese was obtained; but how big of a cheese?), neutral (nothing state was reached), or negative (death state has ended our game).
  • The value function: this is something that our mouse will construct and maintain on the fly. It may be adjusted through the course of an iteration or over many runs through the maze.
  • The model: if we allow our mouse to be aware of the size of its environment, it can store a model of it in its memory. We can represent the world as a 2D grid (array), allowing the mouse to fill in whether there is positive, negative, or no reward in a given grid square as it runs through and observes the actual environment

Let’s dissect a basic, greedy policy an agent might employ:

One of the policies is a Q-table strategy. Q-table stands for ‘quality table’. It is a table of actions and states, as well as the rewards associated with them. We could employ a basic strategy that says when we encounter a state, choose the action that’s going to give our agent the most reward. When our agent doesn’t know what will give the most reward, choose an action randomly. 

A basic Q-table where the rows are potential states and the columns are the actions our agent can take

A basic Q-table where the rows are potential states and the columns are the actions our agent can take

In the beginning, our mouse’s table is empty. It knows nothing. It chooses its strategy randomly, and may move right and receives a small amount of cheese, for example. That is good, and our agent receives a reward signal! The table gets updated accordingly, and our agent will keep choosing actions until it has exhausted all possibilities or has died.  

Hurrah! Our mouse gets some cheese!

Hurrah! Our mouse gets some cheese!

Already, you may see an issue cropping up: when we restart our maze, our agent is inclined to always move towards the small cheese, never opting for an unknown alternative. This is called the explorations-versus-exploitation tradeoff, but we will come back to that in a bit.  

Updating our Q-table for the reward that we have received.

Updating our Q-table for the reward that we have received.

Now that we have visualized how these components work together, let’s take a dive into some of the things that are needed for any reinforcement learning problem that we wish to solve. 

Phrasing reinforcement learning with tasks   

One of the major components to look at for a reinforcement learning application is how is the task structured. These are typically broken down into two categories: episodic or continuous.

Episodic Tasks

Episodic tasks have distinct start and end states. We can save these “episodes” and train on them “off-line.” A prime example would be our Mario levels from our previous article.  


Continuous Tasks

Continuous tasks have no end. This could be like a decision-making algorithm that predicts when someone should buy or sell stocks in the stock market. This is always evolving and changing, with a lot of environmental factors. There are no clear starting and stopping states that would allow us to easily section off an episode to train on for fear of fitting our algorithm to fit too closely to a small segment of time.  

The stock market is always changing. To cut it up into episodes is to ignore the linked continuity of how it evolves

The stock market is always changing. To cut it up into episodes is to ignore the linked continuity of how it evolves

How we formulate our agent’s goals and rewards is shaped by the type of task we are looking to complete, because it can change the nature of when we learn (something we will talk about next).  

When to learn

Timing is critical in how an agent will perform on a task. Perhaps an agent should be learning at every frame of gameplay, or maybe the agent learns in episodes. We could employ a Monte Carlo strategy of cycling through the entire episode of learning and then get better and smarter with each iteration. These options have different tradeoffs and may or may not be feasible depending on the type of task our agent is trying to complete (a continuous task may never use the Monte Carlo strategy since it requires cycling through an episode for training, something that doesn’t even exist for a continuous task!). 


Exploration vs. exploitation tradeoff

The exploration-versus-exploitation tradeoff is something that is quickly encountered when an agent explores an environment. If an agent finds out early on that if it does something simple, it will receive a small amount of reward, it will likely continue to do that simple thing over and over again, accumulating small rewards overtime. If it explores the unknown and tries to find new situations it may gain an even larger reward.  

In human terms, this is like asking the question do you go the restaurant that you always go to and that you know will be good? Or do you venture into the unknown and check out the place that you’ve never tried before that might be completely fantastic? 

If you ask me, the new place looks pretty fantastic

If you ask me, the new place looks pretty fantastic

How an agent’s policy is structured will determine what kind of actions it will learn to exploit and when it will decide to explore. Exploring early on may yield much higher long-term rewards, however, focusing too much on exploration may result in sub-optimal actions in states that we know a lot about. This leads you to end up with less rewards than we could have gotten.  

The exploration-versus-exploitation tradeoff is still very much an open question and is a particularly interesting area of research, in my opinion.  


This brings us to another significant factor in making a reinforcement learning application. Is it value-based or policy-based?

Policy based approach

We’ve mentioned before that an agent’s policy is how it makes decisions on what actions to take based on the current state of the environment. An RL agent with a policy-based approach to learning will try and learn a complex policy with a decision structure that allows it to try and take the optimal action in any given situation. 

Value based approach

On the other end of the spectrum, we have out value-based RL applications. The value function is the current estimate of the long-term reward that our RL algorithm will accumulate. If we have a value-based agent, it will focus on optimizing based on that function. That includes focusing on learning better and better estimates for the long-term reward as well as taking greedy actions to maximize that function at any given time. In a lot of ways, we can think of this as an agent learning an implicit greedy policy for taking actions. 


Actor-Critic Approach

The decision between a value-based and a policy-based algorithm is a significant one in deciding what a reinforcement learning algorithm will look like. The cross-section of these two lines of thinking is called the actor-critic approach. It features keeping track of estimated future reward earnings (our value function) as well as learning new, more complex policies to follow to get our agent larger rewards over longer time scales. It quickly becomes a much harder problem since the algorithm now optimizes two functions at once. 

There is a lot of focus in the actor-critic domain and there have been many cool algorithms that have come out of it. Google’s asynchronous advantage actor-critic (A3C) is a prime example of a cool actor-critic algorithm that has shown a lot of good results.  


Over the last two articles, we have covered the basic terminology as well as some of the more complicated concepts around a reinforcement learning problem. Hopefully, with these two components, you feel that you have a good grasp on what reinforcement learning is and some of the considerations that go into writing an algorithm using it. 

Right now, you might be feeling super excited about RL. You may be wondering how you can get started on systems that work on RL. In the part 3 article, we will dive into where reinforcement learning is excelling, what the major open questions are, and some resources on learning to write RL algorithms yourself!

Ask your questions on SAP Answers or get started with SAP Conversational AI!

Follow us on