## Markov decision process

exist At the beginning of the 20th century, mathematician Andrei Markov studied memoryless random processes called Markov chain Such a process has a fixed number of states and randomly evolves from one state to another at each step. The probability that it evolves from state s to state s  ' is fixed and it depends only on the pair ( s  ,   '), not on past states (that's why we say the system has no memory).

Figure 18-7 shows an example of a Markov chain with four states.

Assuming that the process from the state S  starts , and has a 70% chance in the next state is maintained. Eventually it must leave that state and never come back because no other state points to s  If it goes to state s  1 , then it most likely goes to state s  2 (90% probability), and then immediately returns to state s  (100% probability). It may alternate between these two states many times, but will eventually fall into state s  3 and remainexists forever (it's a terminal state ). Markov chains can have very different kinetics, and they are used heavily in thermodynamics, chemistry, statistics, and more.

The Markov decision process was first described by Richard Bellman in the 1950s 12 They are similar to Markov chains with one difference: at each step, the agent can choose one of several possible actions, and the transition probability depends on the chosen action. Also, some state transitions return some reward (positive or negative), and the agent's goal is to find a policy that maximizes the reward over time.

For example, the MDP represented in Figure 18-8 has three states (represented by circles) and up to three possible discrete actions per step (represented by diamonds).

If it starts from state s0  , the agent can choose  between actions a0 , a1  , or  a2 If it chooses action a  , it just stays in state s  deterministically and without any reward. So it can decide to stay there forever if it wants to. But if it chooses action a  0 , it has a 70% probability of getting a +10 reward and staying in state s  It can then try again and again to get as many rewards as possible, but at some point it will eventually change to state s  In state s1 it has only two possible actions: 0 or It can choose to select action a  0 by repeatingto stay put , or choose to move on to state s  and get a negative –50 bonus (ouch). In state s  2 it has no choice but to take action a  1 , which will likely lead it back to state s  with a +40 bonus along the way. you got the picture. By looking at this MDP, can you guess which strategy will yield the most returns over time? In state s  0 it is clear that action a  0 is the best choice, and in state 2 the agent has no choice but to take action 1 , but in state 1 it is not obvious whether the agent should stay in place ( 0 ) or by fire ( 2 ).

BellmanFind the best state value for any state s in an estimated way noting  *( ), which is the sum of all discounted future rewards the agent can expect on average after it reaches the state , assuming it works optimally . He showed that if the agent takes the best action, thenThe Bellman optimality equation applies (see Equation 18-1 ). This recursive equation states that if the agent takes the optimal action, then the optimal value of the current state is equal to the reward it receives on average after taking an optimal action, plus the expected maximum value of all possible next states that the action might lead to Excellent value to.

##### Equation 18-1. Bellman Optimal Equation

${\mathrm{Volt}}^{*}\left(\mathrm{second}\right)=\underset{\mathrm{A sort of}}{maximum}\sum _{\mathrm{second}\text{'}}\mathrm{Ton}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)\left[\mathrm{resistance}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)+\gamma ·{\mathrm{Volt}}^{*}\left(\mathrm{second}\text{'}\right)\right]\phantom{\rule{1.em}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{4.pt}{0ex}}\mathrm{second}$

In this equation:

•  (   ,   ,   ') is the transition probability from state to state  ' , assuming the agent chooses action a For example, in Figure 18-8 ,  (  s  2  ,  a  1  ,  s  0  ) = 0.8.

•  (   ,   ,   ') is the reward the agent gets when it goes from state s to state  ' , assuming the agent chooses action a For example, in Figure 18-8 ,  (  s  2  ,  a  1  ,  s  0  ) = +40.

• γ is the discount factor.

This equation leads directly to an algorithm that accurately estimates the optimal state value for each possible state: first initialize all state value estimates to zero, then update iterativelyThey use a value iteration algorithm (see Equation 18-2 ). A notable result is that, given enough time, these estimates are guaranteed to converge to the optimal state value corresponding to the optimal policy .

##### Equation 18-2. Value Iteration Algorithm

${\mathrm{Volt}}_{\mathrm{gram}+1}\left(\mathrm{second}\right)←\underset{\mathrm{A sort of}}{maximum}\sum _{\mathrm{second}\text{'}}\mathrm{Ton}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)\left[\mathrm{resistance}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)+\gamma ·{\mathrm{Volt}}_{\mathrm{gram}}\left(\mathrm{second}\text{'}\right)\right]\phantom{\rule{1.em}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{4.pt}{0ex}}\mathrm{second}$

In this equation, V  k ( ) is the estimated value of the state of s in the iterations of the algorithm .

###### NOTES

thisAn algorithm is an example of dynamic programming , which decomposes a complex problem into tractable sub-problems that can be solved iteratively.

knowingThe optimal state value can be useful, especially when evaluating policies, but it will not give us the best policy for the agent. Fortunately, Bellman found a very similar algorithm for estimating optimal state-action values , commonly referred to as Q-Values The optimal Q-value for a state-action pair (  s  ,   ) , denoted as Q  *(   ,   ), is the sum of discounted future rewards the agent can expect on average after reaching state s and choosing action , but at it Before seeing the result of this action, assume it took the best action after that action.

hereHow it works: Again, you first initialize all Q-value estimates to zero, then update them using the Q-value iterative algorithm (see Equation 18-3 ).

##### Equation 18-3. Q-value iterative algorithm

${\mathrm{ask}}_{\mathrm{gram}+1}\left(\mathrm{second},\mathrm{A sort of}\right)←\sum _{\mathrm{second}\text{'}}\mathrm{Ton}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)\left[\mathrm{resistance}\left(\mathrm{second},\mathrm{A sort of},\mathrm{second}\text{'}\right)+\gamma ·\underset{\mathrm{A sort of}\text{'}}{maximum}\phantom{\rule{0.166667em}{0ex}}{\mathrm{ask}}_{\mathrm{gram}}\left(\mathrm{second}\text{'},\mathrm{A sort of}\text{'}\right)\right]\phantom{\rule{1.em}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{4.pt}{0ex}}\left(\mathrm{second},\mathrm{A sort of}\right)$

Once you have the optimal Q-value, defining the optimal policy, denoted by π  *(   ), is trivial: when the agent is in state , it should choose the action with the highest Q-value for that state :${\mathrm{pi}}^{*}\left(\mathrm{second}\right)=\underset{\mathrm{A sort of}}{Maximum parameter} {\mathrm{ask}}^{*}\left(\mathrm{second},\mathrm{A sort of}\right)$.

Let's apply this algorithm to the MDP represented  in Figure 18-8 First, we need to define the MDP:

transition_probabilities = [ # shape=[s, a, s']        [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],        [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],        [None, [0.8, 0.1, 0.1], None]]rewards = [ # shape=[s, a, s']        [[+10, 0, 0], [0, 0, 0], [0, 0, 0]],        [[0, 0, 0], [0, 0, 0], [0, 0, -50]],        [[0, 0, 0], [+40, 0, 0], [0, 0, 0]]]possible_actions = [[0, 1, 2], [0, 2], ]

For example, to know the probability of transition from s  2 to s  after taking action a  1 , we would look for (i.e. 0.8). Again, to get the corresponding reward, we look up (ie +40). To get a list of possible actions in s2  , we will look up ( in this case only action a1  is possible). Next, we have to initialize all Q-values to 0 (except for the impossible operation, where we set Q-values to –∞):transition_probabilitiesrewardspossible_actions

Q_values = np.full((3, 3), -np.inf) # -np.inf for impossible actionsfor state, actions in enumerate(possible_actions):    Q_values[state, actions] = 0.0  # for all possible actions

Now let's run the Q-value iterative algorithm. It applies Equation 18-3 repeatedly to all Q values, for each state and each possible action:

gamma = 0.90 # the discount factorfor iteration in range(50):    Q_prev = Q_values.copy()    for s in range(3):        for a in possible_actions[s]:            Q_values[s, a] = np.sum([                    transition_probabilities[s][a][sp]                    * (rewards[s][a][sp] + gamma * np.max(Q_prev[sp]))                for sp in range(3)])

That's it! The resulting Q-values are as follows:

>>> Q_valuesarray([[18.91891892, 17.02702702, 13.62162162],       [ 0.        ,        -inf, -4.87971488],       [       -inf, 50.13365013,        -inf]])

For example, when the agent is in state s  0 and it chooses action a  1 , the expected sum of discounted future rewards is about 17.0.

For each state, let's look at the action with the highest Q value:

>>> np.argmax(Q_values, axis=1) # optimal action for each statearray([0, 0, 1])

When a discount factor of 0.90, which provides us with the best strategy for this MDP: state s  0 of the selected operation A  state s  in a select operation of A  (i.e. stagnant); and state Action a  (the only possible action) is selected in s  Interestingly, if we increase the discount coefficient of 0.95, the optimal policy changes: the state of the trumpet 1 becomes the best action (through the fire!). This makes sense, because the more you value future rewards, the more willing you are to endure some pain now for future happiness.

#### Technical otaku

Sought technology together

#### Related Topic

##### Lombok uses @With annotation 