Solving MDP with Reinforcement Learning

10 min readJul 20, 2021

In this article we will cover some of the details about solving an example of an MDP problem. MDPs are suitable models for a lot of real world problems. We will try to cover the basics, then we will dive deep into some of the practical steps in solving MDPs.

A Markov Decision Process (MDP) is a mathematical model of the environment.

Reinforcement learning (RL) is an approach to decision-making based on trial-and-error, and it is also known as trial-and-error learning. A key concept in RL is the reward, which provides the agent (the agent is the learner in reinforcement learning) with information about the state of the environment.

The agent receives a reward after completing a task and the agent is rewarded for doing something well (i.e. good). For example, you might be rewarded for getting good grades. The agent receives a reward, and then chooses an action in the environment that maximises the reward.

The agent has to decide about whether to explore the environment (i.e. explore new options) or to exploit the environment (i.e. explore the known options). Exploring an option is a risky decision, as it may result in an unproductive reward.

Exploiting an option is a safe decision, as it is known to be productive, and thus it may be worth doing it. An MDP is a model of how an agent interacts with its environment. The environment is an MDP, and the agent is the agent. An MDP has a state space and a policy. A state is an element of the environment. An element of the environment is a state of the MDP. A policy is a set of rules for choosing actions in the MDP.

The agent’s actions in the MDP are the actions that determine the next state of the MDP. A transition from one state to another is dependent on the policy that determines the next action to take. This process is called action selection, and is performed by the agent.

The MDP can be modelled using the concept of a state in a Markov chain or Markov process, where is a probability distribution on the set of states. An infinite-horizon Markov decision process (or model) is a tuple, where the tuple is a state, an action and a transition.

The Problem

I want to solve a problem, but I don’t know the exact details of it, I only know what happens in each state and the probability of transition from one state to the other. I know I should act so that I reach my goal. In order to solve this problem I should learn two things, first of all I should find out what I should do in each state, if the cost of my action is not too expensive.

It is necessary to get some experience so that we can find out the minimum cost to reach the goal. Secondly I should learn what I should do in each state depending on the previous experience, this means that I have to build a model of my environment, in order to do this, I should find out what I should do if I am in a state and what I should do in order to reach a goal in that state. A good example is a robot that is trying to do a task.

When a goal has been reached or there is no goal set a reward is given, but in some cases there is no reward, this means that the robot should do something in the environment in order to finish a task. We will use an example that we can’t know the exact details of but we are sure that there are rewards in the states and if there are no rewards we will give some penalty, in this example we will assume that there is a cost of 0.01 J if you are in a state with a reward and 0.5 J if you are in a state without a reward, and we will assume that we are given a state space of four states.

If we are in a state without reward we would like to reach a goal, we can reach a goal only if we have a reward, which means that we are in a state with a reward, if that is the case we would like to get a reward that is greater than 0.5 J. We will consider this cost of reward as a penalty. If we are not in a state with a reward we want to reach a state with a reward in the least amount of steps possible.

Calculating the value function

In order to solve this problem, we should find a value function in order to know the expected value of a state. In this case the value function is the sum of rewards.

If there is a state with reward greater than 0.5 J, then we will have a value function with a reward of more than 0.5 J, so, if we are in a state without reward, we will have a penalty of 0.01 J for the first state, then a penalty of 0.5 J in the second state and we will stop since we would not be able to improve.

If there is a state with a reward that is less than 0.5 J, we will have the following: If we are in a state with a reward that is greater than 0.5 J, we will have the following: So, the expected value of a state in this case is 0.01 J, which means that we can reach a reward of 0.5 J if we enter a state with a reward of 0.5 J.

Finishing a task

The only state we need to consider is the state with a reward of 0.5 J, since we are able to finish this task if we are in such a state. We just need to simulate the game for more than 100 iterations, and we will have that this is the state we will be in in most iterations.

To solve the problem, we need to calculate the value function and, from this value function, we need to know what action would get us to the goal state.

As we have seen, the goal state has a reward of 0.5 J, and, from the definition of value function, we will have the following: When we calculate the value function, we have to remember to subtract the discount factor γ. Since the goal state is just one iteration away, the discount factor would be close to 0.

Solving the MDP

Finally, let’s calculate the value of the action we need to take. In this case, it would be to move to the goal state. Since we are in the state where we have a reward of 0.5 J, and the goal state has a reward of 0.5 J, if we move to the goal state we will get a reward of 1 J, so the value of this action would be 1 J. Since we have the value of this action, we will be able to calculate the expected value of the next states we are in, and that will tell us which is the best action to perform.

When we calculate the value of the state with a reward of 1 J, we will have the following: We will need to check if the discount factor is lower than 0.99. If it is lower, we can have the value of 1, and it means that we have to take the action which will get us to the goal state.

If the discount factor is higher, then we need to calculate the probability of getting to the goal state after taking this action, and calculate the expected return. From here, we can evaluate the number of steps to get to the goal state.

After the game has finished, we will be able to have the (expected) return to get to the goal state, and this would be the result of the average of all the values we would get by taking an action. Since we are in a discounted setting, we will have a value of zero because we will be playing until we reach the goal state. That means that the expected return for the first steps will be 1.

For the following steps, it will be reduced, as we are getting closer to the goal state. At some point, we will have a very small value because we are so close to the goal state.

So, we will have the following equation for the final discounted value (as we will have 0 to infinity, the infinity part is the same for all steps and it is the final value, so it does not depend on the number of steps taken):

This is the final value that we will get after having a discount factor of 1 for all the steps. We are able to find the same equation to have the final value in the standard setting with the following: So, we can see that the value will be infinite.

So, if we use a discount factor of 1, we would have a final value of 1, as we would be playing a game forever. If we increase the discount factor, then the final value would be 0. So, it is clear that, depending on the discount factor, we will get a different value.

A higher discount factor will give us a smaller value as it will decrease the value more quickly. The discount factor is important as it will be the factor we will use in the Bellman equations to find the value at the next step. As we have a reward of 1 for taking one step, the value will be 1 and if we have a reward of 0.99 for taking one step, then the value will be 0.99. That is why we are using 1 — gamma in the Bellman equations as it is the discount factor.

What about the gamma value?

The gamma value is important to model the fact that we have a diminishing return. The standard value is 0.995. So, we should get a value of about 0.995 in the standard setting. Now, how can we come up with the value in the episodic setting?

Let us consider the same example where the agent goes from State S0 to S5, S4, S3, S2, S1, S0 and takes the action A1 or A3. There is a very important difference: there is a time delay. When we reach the state S5, the agent takes some time to complete the task and returns to the starting state. In the standard setting, the reward of going from S0 to S5 is 1, the reward of going from S5 to S4 is 0.9, the reward of going from S4 to S3 is 0.7, the reward of going from S3 to S2 is 0.4, the reward of going from S2 to S1 is 0.3, the reward of going from S1 to S0 is 0.1 and finally, the reward of going from S0 to S5 is 1.

However, now that we have the time delay, the value that we will get in each state is 0.99, 0.79, 0.4, 0.1, 0.1 and 1 respectively. When it comes to the time delay, we have a reward of 1 when we reach the goal and a reward of 0 when we delay, and the maximum reward is reached when we delay for a time that corresponds to the reward that we would get if we didn’t have a time delay.

Thus, the immediate reward when we take the action A3 is 0.1 (if we didn’t have a time delay), and the immediate reward of taking action A1 is 0.9. Therefore, the total reward that we get from taking action A1 is 0.9 and the total reward that we get from taking action A3 is 0.1. In the standard setting, the expected return that we get from going from the starting state is 1, because when we reach the goal, we get a reward of 1 and when we delay, we get a reward of 0.99.

However, when we have a time delay, the expected return is 1 when we delay and 0 when we take an action. Thus, the expected return when we take action A3 is 1 and the expected return when we take action A1 is 0.5. Thus, the overall expected return when we take action A1 is 0.75, which is lower than when we take action A3.

Therefore, when we have a time delay, it’s better to take the action that leads to a larger reward. If we have a time delay and we have already taken A3, then the expected return is 1 when we take action A4 and the expected return is 0.99 when we delay, which is lower than 0.75. It is better to delay, even if we have already taken an action, because the expected return after we take the action A4 is higher than the expected return after we delay. If we have a time delay and we haven’t taken an action yet, then we have a question of what we should do next. Obviously, if the action is available and the expected return is higher than 1, then it is better to take the action, even if the reward after we take the action is not greater than that which we would get from delaying.

However, if the action is not available, then the question is not as straightforward. This is due to the fact that the expected return from taking the action is less than the expected return from delaying.

Conclusion

In conclusion, we have shown that time delay will increase the expected utility. If the reward is non-informative, then we show that the delay will be optimal, because it will increase the expected utility.

When the reward is informative, then we have shown that time delay is sub-optimal if the expected utility of delaying is less than that of taking the action. Otherwise, the time delay should be avoided. This was shown by examining the optimal actions and values of the expected return when the reward is informative.

For a more in-depth information about reinforcement learning and MDPs, see the Stanford CS229 course lecture series on this topic.

Thanks for reading.