Extreme Learning Machine (ELM)
ELM theory was proposed to predict wind power, which tends to provide good generation and performance at extremely fast learning speed in theory and practical applications. ELM has the following several advantages:

1)
The parameters of ELM can be set easily, and ELM originally can get a good performance only with fitful references in hidden layers.

2)
The computation of ELM is efficient, which does not need as many iterations as Neural Network (NN) and as complexity as Support Vector Machine (SVM) in when solving quadratic optimization.

3)
ELM has good generalization performance. And the experimental results show that the ELM can achieve good generalization performance in most cases and can learn faster than feedforward neural networks [21].
Now the EML has been widely used in several fields such as face recognition, image classification,wind prediction in shortterm scale. Wind power forecasting can be regarded as an ELM problem, because some factors such as wind speed, air condition, temperature & humidity, wind turbine arrangement have influence on wind production. As for how they exactly affect wind production has not been clearly known [22]. ELM model can be established by using example data and predict the curve of power in shortterm.
The ELM model is based on a singlehidden layer feedforward neural network (SLFN). The advantage of the ELM algorithm is that it distributes the weights and thresholds between the inputting layer and the hidden layer in random and does not need to adjust these random parameters during the whole learning process so that it can complete the training process extremely fast. Based the above advantage, ELM is chosen as a predictor to predict dayahead wind power in the shortterm time scale. The structure of a standard ELM network is demonstrated in Fig. 1.
The main parameters of ELM are described as follow:
$$ \boldsymbol{\omega} ={\left[\begin{array}{cccc}\hfill {\omega}_{11}\hfill & \hfill {\omega}_{12}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{1n}\hfill \\ {}\hfill {\omega}_{21}\hfill & \hfill {\omega}_{22}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{2n}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {\omega}_{l1}\hfill & \hfill {\omega}_{l2}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{\boldsymbol{ln}}\hfill \end{array}\right]}_{l\times n} $$
(1)
where ‘ω’ is the network weight between the input layer and the hidden layer, and ‘ω
_{
ij
}’ is the weight between the i
^{th} input node of the input layer and the j
^{th} hidden node of the hidden layer. ‘l ’ is the number of input nodes in input layer. ‘n’ is the number of hidden nodes in output layer.
$$ \boldsymbol{\beta} ={\left[\begin{array}{cccc}\hfill {\beta}_{11}\hfill & \hfill {\beta}_{12}\hfill & \hfill \cdots \hfill & \hfill {\beta}_{1m}\hfill \\ {}\hfill {\beta}_{21}\hfill & \hfill {\beta}_{22}\hfill & \hfill \cdots \hfill & \hfill {\beta}_{2m}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {\beta}_{l1}\hfill & \hfill {\beta}_{l2}\hfill & \hfill \cdots \hfill & \hfill {\beta}_{lm}\hfill \end{array}\right]}_{l\times m} $$
(2)
where the ‘β’ is the network weight between the hidden layer and the output layer, and ‘β
_{
ij
}’ is the weight between the i
^{th} hidden node of the hidden layer and the j
^{th} hidden node of the output layer. ‘m’ is the number of output nodes in output layer.
$$ b={\left[\begin{array}{cccc}\hfill {b}_1\hfill & \hfill {b}_2\hfill & \hfill \cdots \hfill & \hfill {b}_l\hfill \end{array}\right]}_{l\times 1}^{1} $$
(3)
where ‘b’ is the threshold of the hidden layer.
X is supposed to be the input matrix and the history data X are used to train the ELM network.
$$ \boldsymbol{X}={\left[\begin{array}{cccc}\hfill {x}_{11}\hfill & \hfill {x}_{12}\hfill & \hfill \cdots \hfill & \hfill {x}_{1p}\hfill \\ {}\hfill {x}_{21}\hfill & \hfill {x}_{22}\hfill & \hfill \cdots \hfill & \hfill {x}_{2p}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {x}_{n1}\hfill & \hfill {x}_{n2}\hfill & \hfill \cdots \hfill & \hfill {x}_{np}\hfill \end{array}\right]}_{n\times p} $$
(4)
The real outputting matrix of the ELM network can be defined as below:
$$ \boldsymbol{T}={\left[\begin{array}{cccc}\hfill {\boldsymbol{t}}_1\hfill & \hfill {\boldsymbol{t}}_2\hfill & \hfill \cdots \hfill & \hfill {\boldsymbol{t}}_p\hfill \end{array}\right]}_{m\times p} $$
(5)
And based on the equations (1)–(4), the real outputting matrix of the ELM can be defined as follow:
$$ {\boldsymbol{t}}_{\boldsymbol{j}}={\left[\begin{array}{c}\hfill {t}_{1j}\hfill \\ {}\hfill {t}_{2j}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {t}_{mj}\hfill \end{array}\right]}_{m\times 1}={\left[\begin{array}{c}\hfill {\displaystyle \sum_{i=1}^l{\beta}_{i1}g\left({\boldsymbol{\omega}}_{\boldsymbol{i}}{\boldsymbol{X}}_{\boldsymbol{j}}+{b}_i\right)}\hfill \\ {}\hfill {\displaystyle \sum_{i=1}^l{\beta}_{i2}g\left({\boldsymbol{\omega}}_{\boldsymbol{i}}{\boldsymbol{X}}_{\boldsymbol{j}}+{b}_i\right)}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\displaystyle \sum_{i=1}^l{\beta}_{im}g\left({\boldsymbol{\omega}}_{\boldsymbol{i}}{\boldsymbol{X}}_{\boldsymbol{j}}+{b}_i\right)}\hfill \end{array}\right]}_{m\times 1},\left(j=1,2,3,......,p\right) $$
(6)
where
$$ {\boldsymbol{\omega}}_{\boldsymbol{i}}=\left[\begin{array}{cccc}\hfill {\omega}_{i1}\hfill & \hfill {\omega}_{i2}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{in}\hfill \end{array}\right] $$
(7)
$$ {\boldsymbol{X}}_{\boldsymbol{j}}={\left[\begin{array}{cccc}\hfill {x}_{1j}\hfill & \hfill {x}_{2j}\hfill & \hfill \cdots \hfill & \hfill {x}_{nj}\hfill \end{array}\right]}^T $$
(8)
g(x) is an activation function in the hidden layer of the ELM.
The following equations can be acquired by Eqs. (5)–(8):
$$ \widehat{\boldsymbol{\beta}}={\boldsymbol{H}}^{1}{\boldsymbol{T}}^T $$
(9)
where
$$ \begin{array}{l}\boldsymbol{H}\left({\boldsymbol{\omega}}_1,{\boldsymbol{\omega}}_2,\cdots, {\boldsymbol{\omega}}_l,{b}_1,{b}_2,\cdots, {b}_l,{\boldsymbol{X}}_1,{\boldsymbol{X}}_2,\cdots {\boldsymbol{X}}_p\right)\\ {}=\left[\begin{array}{cccc}\hfill g\left({\boldsymbol{\omega}}_1\cdot {\boldsymbol{X}}_1+{\boldsymbol{b}}_1\right)\hfill & \hfill g\left({\boldsymbol{\omega}}_2\cdot {\boldsymbol{X}}_1+{b}_2\right)\hfill & \hfill \cdots \hfill & \hfill g\left({\boldsymbol{\omega}}_l\cdot {\boldsymbol{X}}_1+{b}_l\right)\hfill \\ {}\hfill g\left({\boldsymbol{\omega}}_1\cdot {\boldsymbol{X}}_2+{\boldsymbol{b}}_1\right)\hfill & \hfill g\left({\boldsymbol{\omega}}_2\cdot {\boldsymbol{X}}_2+{b}_2\right)\hfill & \hfill \cdots \hfill & \hfill g\left({\boldsymbol{\omega}}_l\cdot {\boldsymbol{X}}_2+{b}_l\right)\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill g\left({\boldsymbol{\omega}}_1\cdot {\boldsymbol{X}}_p+{\boldsymbol{b}}_1\right)\hfill & \hfill g\left({\boldsymbol{\omega}}_2\cdot {\boldsymbol{X}}_p+{b}_2\right)\hfill & \hfill \cdots \hfill & \hfill g\left({\boldsymbol{\omega}}_l\cdot {\boldsymbol{X}}_p+{b}_l\right)\hfill \end{array}\right]\end{array} $$
(10)
And H
^{−1} is pseudoinverse matrix of H. The ELM can be solved in the following algorithm:
Algorithm ELM: Given a training set {(X, T) X∈R
^{n×p}, T∈R
^{m×p} }, activation function g(x), testing set \( \widehat{\boldsymbol{X}} \) and hidden node number p.
Step one: Randomly assign input weight ω and bias b.
Step two: Calculate the hidden layer output matrix H.
Step three: Calculate the output weight matrix β.
Step four: Input matrix \( \widehat{\boldsymbol{X}} \) and get the output testing results by the transform (9).
Error correction
Based on the ELM forecasting results, an error correction model is applied to obtain the ultrashortterm forecasting. The persistence method is used as a benchmark model to examine whether an advanced model can perform well. In this model the future wind power will be the same as occurred in the present time step as given by
$$ {\widehat{P}}_{t+k\Bigt}={P}_t $$
(11)
where \( {\widehat{P}}_{t+k\Bigt} \) is the forecast at time t for the lookahead time k and P
_{
t
} is the measurement at time t.
In comparison with wind power, the temporal characteristics of wind power forecasting errors are less mentioned in literatures. However, it is found that the forecasting error level at next time point tends to keep the same as present time point by analyzing the states transition probability among different error levels. Thus, the error for next time point can be written as
$$ {\widehat{e}}_{t+1\Bigt}={e}_t $$
(12)
where e
_{
t
} is the deviation between forecasted and measured wind power.
$$ {e}_t={p}_t{\widehat{p}}_{t\Bigt1} $$
(13)
The computed error is then added to the forecasted wind power for next time point to get the corrected forecasting.
$$ {\tilde{p}}_{t+1\Bigt}={\widehat{p}}_{t+1\Bigt}+{\widehat{e}}_{t+1\Bigt}={\widehat{p}}_{t+1\Bigt}+{e}_t $$
(14)
The flow chart of wind power forecasting procedure is shown in Fig. 2.
Data description and preprocessing
The proposed model is verified using the measured data in a wind farm located in the northern China for a period of about 15 months from 24 February 2014 to 31 May 2015. The 41072 nonconsecutive data points before 02 March 2015 are used for training the ELM models whereas the consecutive time series of 66 days from 02 March 2015 to 31 May 2015 is used to verify the models performances. The total installed capacity of the wind farm is 50 MW. The measured data are used for both training the ELM model and verifying the model. The time scale of collecting data is 15 min. The scatter of wind power versus wind speed of the wind farm is plotted in Fig. 3.
The characteristic of wind speed is shown as the frequency histogram in Fig. 4. It can be well fitted using a Webull distribution.
The mechanical power extracted from wind by a wind turbine is a function of the wind speed, blade pitch angle, and shaft speed. The algebraic equation shown below characterizes the power extracted [23].
$$ {P}_m=\frac{1}{2}\rho {v}_w^3\pi {r}^2{C}_p\left(\lambda \right) $$
(15)
where P
_{
m
} is the power extracted from the wind, in watts; ρ the air density, in kg/m^{3}; r the radius swept by the rotor blades, in m; v
_{
w
} the wind speed, in m/s; C
_{
p
} the performance coefficient; λ the tip ratio, i.e., the ratio of turbine blade speed to that of the wind
$$ \lambda =\frac{\omega_tr}{v_w} $$
(16)
where ω is mechanical rotor speed in radians/s.
From Eq. (15) it is noted that the air density, the wind speed are not quantities that can be controlled. That means a wind turbine will yield different wind power output even at the same wind speed. A wind farm comprises tens or even hundreds of turbines, which making the relationship between the farm output and speed much weaker than that of a wind turbine. Even so, the wind power output depends on wind speed obviously, as shown in Fig. 3. The object of modeling an ELM is to characterize such kind of implicit dependence. However some anomalous data exists in the original datasets, which will have negative influence on the wind power forecasting accuracy. Two kinds of anomalies are supposed to be eliminated before building an ELM model.

1)
When the wind speed is very large (e.g. larger than 5 m/s) but the corresponding wind power is close to zero.

2)
When the wind speed is close to zero but the wind farm output is very large (e.g. larger than half of the rated capacity of the wind farm).
Moreover, wind speed and power data are normalized by using the following formula
$$ {x}_{normal}=\frac{x \max (x)}{ \max (x) \min (x)} $$
(17)
where x is the original data, x
_{normal} is the normalized data, max(x) is the maximum of original datasets, and min(x) is the minimum of original datasets.