As mentioned earlier, a MDP consists of a set of states, a set of actions in each state, a state transition probability function \(T\left({s}^{\mathrm{^{\prime}}}|s,a\right)\), a reward function R(*s*), and a discount factor \(\left(\gamma \right)\) that is a number between zero and one. The discount factor describes the preference for current rewards over future rewards [16]. Consequently, the MDP model is formulated as:

$$\mathrm{MDP}=\left\{\mathrm{S},\mathrm{A},\mathrm{T}\left(\mathrm{s{^{\prime}}}|\mathrm{s},\mathrm{a}\right),\mathrm{R}\left(\mathrm{s}\right),\gamma \right\}$$

(1)

where \(S\) is the set of states, \(A\) is the set of all possible actions, \(\mathrm{T}\left({\mathrm{s}}^{\mathrm{^{\prime}}}|\mathrm{s},\mathrm{a}\right)\) is the transition probability function expressing the probability of being in state \({\mathrm{s}}^{\mathrm{^{\prime}}}\in S\) when control (action) \(a\in A\) is taken in a state \(s\in S\). \(\mathrm{R}\left(\mathrm{s}\right)\) is the reward function that defines the value of immediate reward for a state \(s\in S\). Each ingredient of the model for stochastic control of the MPPT of a PV energy system is presented below.

### 3.1 States

State-space is the collection of information regarding the MPPT. Each state variable has its own domain. Union of domains of all state-variables constitute state-space. After careful deliberation and literature survey, the states and state variables are defined as follows:

$$\begin{gathered} S_{i} = \left\{ {s_{1} ,s_{2} , \ldots ,s_{N} } \right\} \hfill \\ s_{i} = \left\{ {e_{i} ,Irr_{i} ,V_{i} ,V_{i}^{*} ,\alpha _{i} } \right\} \hfill \\ \alpha _{i} \in \left\{ {{\text{0}},{\text{1}}, \ldots ,m} \right\} \hfill \\ V_{i} ,V_{i}^{*} \in \left\{ {V_{{min}} ,V_{{min}} + \delta _{1} ,V_{{min}} + 2\delta _{1} , \ldots ,V_{{max}} } \right\} \hfill \\ Irr_{i} = \left\{ {Irr_{{i,1}} ,Irr_{{i,2}} , \ldots ,Irr_{{i,q}} } \right\},q \in \mathbb{N} \hfill \\ Irr_{{i,j}} \in \left\{ {Irr_{{min}} ,Irr_{{min}} + \delta _{2} ,Irr_{{min}} + 2\delta _{2} , \ldots ,Irr_{{max}} } \right\} \hfill \\ e_{i} \in \left\{ {e_{{min}} ,e_{{min}} + \delta _{3} ,e_{{min}} + 2\delta _{3} , \ldots ,e_{{max}} } \right\} \hfill \\ \end{gathered}$$

(2)

In (2), there are a total of \(N\) states, and each state has five variables and consequently five types of information among these variables. Variable \({e}_{i}\) is the error between the maximum expected power at a given voltage and the actual power received at that voltage. It is assumed that the maximum expected power at each voltage is known. The error variable ranges from \({e}_{min}\mathrm{ to }{e}_{max}\) and the interval between any two values of \({e}_{i}\) is \({\delta }_{3}\), which can be any positive constant and is chosen according to its suitability to the problem. A small \({\delta }_{3}\) will result in large state space while a large \({\delta }_{3}\) will result in small state space. Irradiance \({I}_{r{r}_{i}}\) is the irradiance vector containing the irradiance of every panel in state \(i\). Each panel has different ranges of irradiance, defined in (2) as \(Ir{r}_{i,j}\) where \(j\) ranges from 1 to \(q\) and \(q\) is the total number of solar panels in the system. \({V}_{i}\) is the current PV output voltage determined by the selection of the duty cycle, while \({V}_{i}^{*}\) is the last value of voltage \(V\) in the last m actions which has achieved a minimum value of \(e\). Variable \({\alpha }_{i}\) indicates the number of actions executed for searching the MPP (\({\alpha }_{i}\) ranges from 0 to \(m\), counting up to \(m\) actions).

With the above selection of state space, it is possible to determine whether the current voltage is achieving maximum expected power or not, while the irradiation level at each panel (or group of panels) is also determined. In practice, irradiation measurement sensors are required to provide such information. In addition, information about the best possible voltage (\({V}^{*}\)) within the last \(m\) voltage values that yields lowest error is also obtained. This information is incorporated in the state space because, in practice, the irradiation measurement may not be accurate or there may be some anomaly causing the MPP to shift from the expected value. Under such circumstances, it would be impossible to make the error (\(e\)) equal to zero. Therefore, \({V}^{*}\) is the voltage that may be adopted by the system if the error remains nonzero for more than *m* consecutive steps (or decision epochs). Once the error becomes zero, the value of \({\alpha }_{i}\) is reset to zero.

### 3.2 Actions

Actions basically refer to all the available decisions in each state for the MPPT of the PV system. Based on selected state variables, actions include the selection of voltage *V* for the PV system. If no change in the current value of *V* is desirable, a no-operation (*NOOP*) can be selected as an action. This usually happens when the value of *e* is either zero or smaller than a threshold.

$$A=\left\{{a}_{1},{a}_{2},\dots ,{a}_{x},NOOP\right\},x\in {\mathbb{N}}$$

(3)

Each action in (3) (except for *NOOP*) corresponds to a particular value of \(V\) (assuming there are \(x\) possible values for \(V\)). Once a value of \(V\) is selected, the duty cycle is then adjusted accordingly to achieve the value of \(V\). Note that the subscript \(i\) is used with \(V\) when it refers to the value of the voltage in a particular state \({s}_{i}\). Otherwise, the subscript is not used.

### 3.3 Reward function

Reward function is a measure of how good or bad a state is, and can be regarded as a negative of cost function. In the MPPT problem, it is preferable to drive the error to zero. Hence the reward must be inversely proportional to the error. In addition, it is also desirable to make the value of \(V\) equal to \({V}^{*}\) especially when it is not possible to reduce the error to zero. Therefore, a reward for making \(V\) equal to \({V}^{*}\) is required that should be inversely proportional to the value of \({\alpha }_{i}\).

Based on the above discussion, the reward function is given as:

$$\begin{gathered} R\left( {s_{i} } \right) = r_{1} (s_{i} ) + r_{2} (s_{i} ) \hfill \\ r_{1} (s_{i} ) = \left\{ {\begin{array}{*{20}c} {\left( {\lambda _{1} - e_{i} } \right) - \beta ife_{i} } & {lt;threshhold} \\ {\left( {\lambda _{2} - e_{i} } \right)ife_{i} \ge threshhold} & {} \\ \end{array} } \right. \hfill \\ r_{2} (s_{i} ) = \left\{ {\begin{array}{*{20}c} {\left( {\frac{1}{{1 + \alpha _{i} }}} \right)ifV_{i} = V_{i}^{*} } \\ {0otherwise} \\ \end{array} } \right. \hfill \\ \end{gathered}$$

(4)

In (4), the first term (\({r}_{1}\)) is to endorse the fact that desired power must be equal to the maximum expected power. It means that if it is not operating at MPP this term will have a value, while \(e\) will be zero or close to zero if operating at MPP, and then a high reward is obtained. Note that \(\beta\), \({\lambda }_{1}\), and \({\lambda }_{2}\) are positive constants to be selected by the user. The second term (\({r}_{2}\)) in (4) is to endorse the need for the system to be operating at the best possible voltage and further actions are discouraged. If the system is not operating at maximum possible voltage, this whole term will be zero so further actions taken to improve the voltage will have no reduction in reward. In the second term \(\alpha\) is the action executed since the last time when \({V}_{i}={V}_{i}^{*}.\)

### 3.4 Transition probabilities

When modeling real-world decision-problems in the MDP framework, it is often impossible to obtain a completely accurate estimate of transition probabilities. The MPPT problem consists of multiple states, and a state is an assignment to multiple state variables. Therefore, joint probability distribution can be involved in determination of full state transition mapping. In this regard, a Bayesian Network (Bayes Net) is used, which helps simplify the probabilistic representations. It can capture uncertain knowledge in a natural and efficient way. Independence and conditional independence relationships among variables in Bayes Net can greatly reduce the number of probabilities that need to be specified in order to define the full joint distribution. Bayes Net basically represents dependencies among variables and a Bayes Net is a directed graph in which each node is annotated with quantitative probability information [10]. Bayes Net of the proposed stochastic control for MPPT of PV energy system is shown in Fig. 4.

There is only one random variable which is the error (\(e\)) in the power. This variable is random because shading is random. It depends upon the irradiance value received from the panels and its own previous value.

The expression for the state transition probability is presented as:

$$T\left({s}_{j}|{s}_{i},{a}_{k}\right)=P\left({e}_{j}|{e}_{i},{{I}_{rr}}_{i,1},{{I}_{rr}}_{i,2},\dots ,{{I}_{rr}}_{i,q},{a}_{k}\right)$$

(5)

where \({e}_{j}\) is the value of error in state \({s}_{j}\) and \({I}_{rri,*}\) are the values of irradiation in state \({s}_{i}\). Note that besides the stochastic transitions, there are also deterministic transitions in the state variables. For example, the values of \({\alpha }_{i},{V}_{i}\) change deterministically depending upon what action is taken. Also, the value of \({V}^{*}\) either remains the same or becomes equal to the value of \({V}_{j}\) depending upon the value of \({e}_{j}\).

To facilitate the calculation of the transition probabilities, a function-based approach is proposed. The intuition behind the proposed approach is that the error (\({e}_{i}\)) can either increase, decrease, or stay the same. Therefore, at each state, it needs to characterize three probability values, i.e., the probabilities of increase, decrease and no change in \({e}_{i}\). The functional form can be written as:

$$P\left({e}_{j}={e}_{i}+{\delta }_{3}|{e}_{i},{{I}_{rr}}_{i,1},{{I}_{rr}}_{i,2},\dots ,{{I}_{rr}}_{i,q},{a}_{k}\right)=\left\{\begin{array}{cc}0& if {e}_{i}={e}_{max}\\ {p}_{1,k}& otherwise\end{array}\right.$$

(6)

Similarly,

$$P\left({e}_{j}={e}_{i}-{\delta }_{3}|{e}_{i},{{I}_{rr}}_{i,1},{{I}_{rr}}_{i,2},\dots ,{{I}_{rr}}_{i,q},{a}_{k}\right)=\left\{\begin{array}{cc}0& if {e}_{i}={e}_{min}\\ {p}_{2,k}& otherwise\end{array}\right.$$

(7)

Also,

$$P\left({e}_{j}={e}_{i}|{e}_{i},{{I}_{rr}}_{i,1},{{I}_{rr}}_{i,2},\dots ,{{I}_{rr}}_{i,q},{a}_{k}\right)=\left\{\begin{array}{cc}1-{p}_{1,k}-{p}_{2,k}& if {e}_{min}<{e}_{i}<{e}_{max}\\ \begin{array}{c}1-{p}_{1,k}\\ 1-{p}_{2,k}\end{array}& \begin{array}{c}{e}_{i}={e}_{min}\\ {e}_{i}={e}_{max}\end{array}\end{array}\right.$$

(8)

Here \({p}_{1,k},{p}_{2,k}\in \left(\mathrm{0,1}\right]\) are the probability values that depend upon the action to be executed. For example, the probability of increase in error would be large if an action results in \({V}_{j}\) being far away from \({V}_{i}^{*}\). Note that the above equations take into account only one unit change in the error, while the probabilities for two or more unit changes can be defined in a similar manner.

### 3.5 Discount factor

The discount factor (\(\gamma\)) is a number between zero and one indicating the depreciation in the value of reward with respect to the decision epochs. If \(\gamma\) is close to zero, only the immediate rewards have significant value and the rewards in the distant future have insignificant value. On the other hand, if \(\gamma\) is selected to be close to one, the distant rewards have almost the same value as that of the immediate rewards. For the MPPT problem, the discount factor is selected to be close to one so that the resulting optimal control policy is ‘far sighted’, i.e., distant future states are given significant value (and not just the near future states) while calculating optimal action.