Start building your own chatbot now!

Imagine a world where every computer system is customized specifically to your own personality. It learns the nuances of how you communicate and how you wish to be communicated with. Interacting with a computer system becomes more intuitive than ever and technological literacy sky rockets. These are the potential outcomes you could see in a future where reinforcement learning is the norm.

In this article, we are going to break down reinforcement learning and dissect some of the components that come together to make up a reinforcement learning system.

What is reinforcement learning?

If you’ve never heard of reinforcement learning (RL) before, don’t fret! The concept is very straight forward. At a very high level, reinforcement learning is simply an agent learning to interact with an environment based on feedback signals it receives from the environment. This makes it different from other machine learning approaches where a learning agent might see a correct answer during training. In reinforcement learning, we can think of our learning agent as getting a grade or a score to let it know about its performance.

Let’s frame this idea in terms of a video game. Say we have a computer program that plays the game Mario. It learns to control the character, receiving feedback from the environment in the form of a changing screen. Based on the successes (or failures) of our algorithm, it can learn to interact with the environment and improve by using the feedback it receives.

To learn about the environment, we need to explore! The only way we can find out that Goombas are bad and power-ups are good is through trail-and-error and feedback.

Reinforcement learning tries to imitate the way a human or other intelligent being might interact with a new environment: trial and error. It is born out of the culmination of the research of many fields such as computer science, psychology, neuroscience, mathematics, and more. Though it is uncommon to see RL in industry today, it’s potential for impact is huge.


 Reinforcement learning really is the culmination of many fields and has a rich history in optimization and behavioral psychology.

This potential is what I am aiming to unpack for you.  

Reinforcement learning vocabulary as Mario Bros game  

Already we have touched upon the classic example of a RL to play a video game. Now let’s continue to use our Mario example while we dig a little deeper into that idea and the vocabulary around the concept.  

The Agent: Mario  

To start, we have our agent. Our agent is our algorithm and our program. It is the brains of the operation. It is what is going to interact with our environment. In this context, our agent is Mario and he will call all the shots. 


Our agent: Mario

The Environment: Game Level  

The agent exists within the scope of an environment. The environment is the level of Mario we are playing. It is the enemies on the screen and the blocks that make up the world. It is the clock that is ticking down and the score that is going up (or so we hope!). Our agent’s goal is to interact with the environment in such a way that it gains a reward.


Our environment: a simple level

The Action: Jump, duck, move forward 

What is a reward and how does our agent receive it? Well, our agent has to interact with the environment. It can do so by choosing an action from a list of potential actions that are possible for it to take. Maybe our agent, Mario, decides to jump upwards. Or move to the right or the left. Perhaps they have a fireball power-up and so our agent decides to fire one off. The point is, each of these actions will alter the environment and will result in a change. Our agent can observe this change, use it as a feedback signal, and learn from it.


The interface a human might use to execute actions that affect the environment

The State: Mario + Action + Environment = State  

These changes that our agent observes are changes to the state of the environment. The new state that our agent observes may generate a “reward” signal. Combining together the action the agent took, the change in state, and the potential reward received from the change in state, the agent begins to build a working model for the environment that they are exploring.

 The state holds all the information about what’s going on in the environment from what we can observe. Things like where our character is, our current score, and enemies on the screen all play into what the state of our environment is currently. 

The Reward: Points + Staying Alive   

If the agent learns that when it jumps and lands on an enemy, it gets a point boost and can no longer get killed by said enemy, that is a good thing to learn! It also might learn that if Mario falls down into a hole, the game is over and there is no future opportunity to gain any more points or to win the level. These are things that the agent may learn over time, the more it interacts with the environment the more it learns.


In Mario, a good way to measure reward might be the score!

That encompasses an introduction of all the major components that play into a reinforcement learning problem. The important things to retain from this section are the agent, environment, actions, state, and rewards, and to try and have a working definition in your head of what these entail.

This image pulls these all together very nicely if you’re more of a visual learner.



All of the components coming together to make up how an agent learns from its environment!

How does it work?

Now that we understand some of the basic vocabulary, we can apply it to learn how an agent operates. How does an agent decide to make decisions about the actions it should take to maximize the reward it is going to get? 

There are two main streams that we need to dissect to understand: The RL agent needs and its sub-elements.  

Reinforcement Learning Needs  

RL agents must learn to decide what is a good action to take in an environment that is filled with uncertainty. Feedback is received as a time-delayed reward signal as the observed change in state and the reward that can be calculated from it. The agent must be able to explore this uncertainty and to reason about why a reward was given. To do this, the agent needs to have three simple things: Actions, goals, and senses.


Actions are the list of manipulations to the environment that an agent can take at any given moment. By exercising an action, an agent impacts its environment and changes its state. Without being able to do this, an agent can never actively influence the state, receive any interpretable reward from how its actions positively or negatively influenced the environment, or even learn to take better actions in the future.



A list of actions someone might take with an Atari controller.


Goals are how we define the reward signal. Do we reward based on points in a video game? Completion of a level? What are good and bad actions? These are the questions that we must think about when defining a goal in an RL context. This is how we motivate an agent to complete a task.

 A simple setup of a goal. How can one get from start to finish?


Senses are what an agent uses to observe an environment. In a video game setting, it might be useful to use techniques from a computer vision setting to observe objects on the screen and how they change when actions are taken by our agent. Maybe we use optical character recognition to observe a point value. The point is, if an agent cannot sense an environment, they cannot reason about how their actions affect it. Therefore, we need senses in order to monitor the environment we are interacting with.

Sub-Elements of a Reinforcement Learning System  

Now, we can transition into the sub-elements of an RL system: the policy, the reward signal, the value function, and the optimal model of the environment.

The Policy  

A policy is the heart of our RL agent. It is the way our agent behaves given the current state of the environment. It is the actions our agent will take given the state. In biology, we might see a policy as how an organism reacts based on stimuli it receives. Our agent observes the state of the environment and the policy is what it has learned to do. A good policy would result in a positive outcome.

 Our policy will dictate what an agent will do given a state of the environment. We can see here a policy might be that given a certain tile, our agent moves in a certain direction.

The Reward Signal  

The reward signal is how we measure the success of our agent. It is our numerical measure of how well we are succeeding at our goal. A reward signal can be positive or negative, thus allowing our agent to measure if an action was good, bad, or neutral. These can be point values in a video game or whether or not our agent is still alive. The point is that our agent takes in these reward signals, measures how the performance on the goal currently is, and shapes its policy based on this feedback so that it might further work to alter the environment so as to maximize what future reward it may receive.

We can think of this as the hidden reward mapping from the previous goal image. Only by exploring the environment the agent can learn that stepping on the goal tile yields a reward of 1!

The Value Function  

We can think of the reward signal as an immediate indicator of if an action was good or bad. However, reinforcement learning is about more than immediate positive or negative results. It is about long-term planning to maximize success at a task. To model this long-term performance, we introduce a concept called the value function. A value function is an estimate of how likely our agent is to have long-term success. This is far harder to estimate and measure, yet it is one of the most critical components to our RL problem! In an uncertain environment, our agent will constantly modify their estimates of value over many iterations, learning to better shape policies and actions to take over long sequences of actions and states.


A visualization of a value function being shaped by an agent. As it becomes more and more certain about its potential long term reward given its state, it can come up with solutions to this challenge.

The Optimal Model for the Environment  

Finally, our RL system may model the environment. I say may because not all RL agents will model an environment. Some agents simply learn by trial-and-error, constructing a somewhat implicit model of the environment by way of a good value function and policy combination. Other agents may explicitly create an internal model of an environment, allowing an agent to predict resultant states and rewards based on actions it wishes to take directly. This seems like it would be a very good approach, but in highly complex environments it is extremely hard to build such an internal model and so often times agents will not opt for this strategy.


As an agent explores an environment, they could build a 3D interpretation of the world around them to help them reason about the actions they might take in the future.


With these basic concepts, we can begin to see this future, where computer systems learn based on our actions and reactions, tuning specifically to our personalities. In the Mario agent of our example above, we can envision futuristic computer systems that read our actions and reactions like a Mario character reads an environment. It gets more reward signal the happier it makes us and the quicker we achieve our goals. It is very easy to see how this future outcome could be within our reach.

Part 2 and Part 3 available!

All of this taken together gives us a basic overview of how a reinforcement system performs and operates.

This high-level primer will be helpful for our part 2 article where we talk about how Reinforcement Learning compares to other types of machine learning and some factors we consider in formulating a Reinforcement Learning problem, and for our part 3 article where we look at some of the recent accomplishments and open research questions in the field of Reinforcement Learning.

Bonus Content

Watch a Mario Game in Action! See if you can identify all the elements that you would need in a reinforcement learning scenario.  

Ask your questions on SAP Answers or get started with SAP Conversational AI!

Follow us on