Introduction
The diversity of patterns and components within the behavior of climate time series data in general and evaporation data in particular suffers from the problem of non-linearity as is most climate and environmental data because of its complexity in terms of seasonal and periodic fluctuations that affect the homogeneity of the data, which will lead to additional complications and obstacles in analyzing as time series and forecasting it, and this may negatively affect the accuracy of the forecast results, as the results may be unsound. The accuracy of forecast results for evaporation may depend on the selection of the appropriate method used for analysis and forecasting mainly.
The problem of data heterogeneity due to the effects of seasonal and periodic patterns, the evaporation time series data will be divided into two seasons; the first is the hot season, which includes data for the months (May-September), and the cold season, which includes data for the months (November-March). The autoregressive and integrated moving average (ARIMA) model is one of the most famous time series models that is commonly used for univariate time series analysis as a traditional statistical model for forecasting. The ARIMA model is a linear model that may not be suitable for dealing with data that suffer from non-linearity problems. However, the ARIMA model is often used with climate data as with climate variables, including variable evaporation time series (1, 2). Two forecast methods, represented by the regression tree (RT) and the ARIMA model will be proposed and then hybridized by reconciling them together within the hybrid ARIMA-RT model as a developed method used to improve forecasting results by dealing well with the problem of non-linearity data. Wavelet transformations (WT) will be used also with both the ARIMA model and the hybrid ARIMA-RT model to test the role of their effect in improving forecasting results in two developed models, namely ARIMA-Wavelet and RT-Wavelet, based on the ARIMA model.
In previous studies, many researchers used the ARIMA model in the presence of the WT effect with many time series data, whether climate or other, as in (3, 4). In light of the presence of the WT effect, the RT model was used, and it was also used after reconciling it in the hybrid ARIMA-RT model to forecast many series data. The results of this study reflected the superiority of the hybrid method compared to the traditional ARIMA model. The results included that forecasts were unaffected by the use of WT clearly. There was a clear improvement in the forecasting accuracy of the ARIMA-Wavelet model compared to the traditional ARIMA model. In the case of comparing the two hybrid models, ARIMA-RT and RT-Wavelet, there was no clear effect and role for using WT. This is because the basic development and improvement was accomplished after proposing RT after reconciling it with ARIMA to deal well with the problem of nonlinearity. From this study, it can be concluded that the ARIMA-RT hybrid model has a clear role in improving the accuracy of forecast results. As for using WT, it leads to an improvement in the accuracy of forecasts in the absence of introducing the effect of RT and reconciling it with traditional models, and this effect may vary depending on the data, its nature, and its homogeneity.
Materials and methods
The framework for this study will include:
- Data initialization.
- Modeling using ARIMA models.
- Using the WT method with the residuals series resulting from the ARIMA model and forecasting using the ARIMA-Wavelet hybrid method.
- Modeling using the RT method based on the ARIMA model structure, referred to as the hybrid ARIMA-RT model.
- Modeling RT method based on hybrid ARIMA-Wavelet method structure and referred to as the hybrid RT-Wavelet model.
- Comparison of forecast results using ARIMA, ARIMA-Wavelet, ARIMA-RT, and RT-Wavelet methods. This framework can be illustrated as in Figure 1 below.
Figure 1: The general framework of the study
Autoregressive integrated moving average (ARIMA) model
The reason of unstable the time series is the variance or the series average, or both, change with time (unstable). Then some manipulations can be performed on the time series for the purpose of achieving stability in it. If the variance varies by time, then we can perform some transformation formulas on the data to achieve stability in it. This type of transformations is called power transformations. The condition of stability in the time series can also be achieved through the method of differences for unstable series, which is called (first-order differences), as it represents the first difference between two consecutive observations values , and when stability is not achieved, one resorts to differences from Higher rank or seasonal rank. The ARIMA model is one of the traditional models for univariate time series forecasting. The Box-Jenkins method will be adopted in its four stages: identification, parameter estimation, diagnostic examination, and forecasting. As an optimal method for analyzing time series, the multiplicative seasonal ARIMA model [ARIMA(p,d,q)(P,D,Q)s] was used, which takes the following general mathematical formula.
and are the non-seasonal parameters of the auto regression and the moving averages at step k respectively, and reflect the effect of changing the time series variable ( ) and the random variable ( ) at step k respectively, , and are the seasonal parameters of the auto regression and the moving averages respectively, S represents the seasonal period, p indicates the rank of the autoregressive model, d represents the number of differences necessary to achieve stability, q symbolizes the rank of the moving averages, P indicates the rank of the seasonal autoregressive model, Q symbolizes the rank of the seasonal moving averages, D represents the number of seasonal differences necessary to achieve stability, B is the backshift operator, is the white noise with zero arithmetic mean and variance , and we can write it as follows (5, 6).
It is necessary, when analyzing a time series, to draw the time series graphically in order to identify many of its features, especially determining whether the time series is stable or unstable, in addition to other features. The autocorrelation and partial autocorrelation functions are useful tools for demonstrating the stability of a time series. The stability condition of an ARIMA model require that that the mean and variance are constants (independent of time t) . In many cases, time series are unstable and the reason for this is due to a change in the mean over time, for example it has a general trend, or due to a change in the variance of the series over time. If the time series is unstable, we can achieve weak stability in it, or sometimes we call it second-order stability (7, 8).
Beginning to identify the features of the series and determine the ranks of the polynomials in the (ARIMA) model and the number of parameters (p, q, P, Q) after achieving the stability of the series. The table below shows a simplified methodology for determining the ranks of the polynomials in (ARIMA) and the number of parameters in the model through the (ACF) (PACF) functions. This is a summary of the recognition phase of the Box and Jenkins methodology. After completing the first stage or step, which is identifying the hypothetical (ARIMA) model using the Box-Jenkins method, the second step is estimating the model parameters through the model’s potential function method. Such estimates are referred to as maximum potential estimates.
From the statistical aspect, the non-significance of the estimated parameters is one of the reasons that impair the accuracy of the model. Therefore, the null hypothesis will be tested, which states that the parameter estimates are not significantly different from zero, that is, equal to zero, as the critical value of the (t) test is the tabular value multiplied by the estimated standard error of the parameter. If the absolute value of the value calculated for the t-test for each estimator is at least equal to the critical value, then the null hypothesis, meaning that the estimator is significant, will be rejected. The significance of all model parameters is one of the most important conditions that must be achieved as one of the most important diagnostic examination procedures. Also, the autocorrelation state of the residual’s series can be used to test whether the residuals series is identical to the white noise process . Therefore, the autocorrelation function of the residuals must not contain significant correlation coefficients as an enhancement measure to pass the diagnostic tests. There are other tests to diagnose the quality of the model, but the significance of the parameters and the ACF status of the residuals are among the most prominent and most important for the model to be qualified to be used to forecast future values with good performance (9, 10).
From the statistical aspect, the non-significance of the estimated parameters is one of the reasons that impair the accuracy of the model. Therefore, the null hypothesis will be tested, which states that the parameter estimates are not significantly different from zero, that is, equal to zero, as the critical value of the (t) test is the tabular value multiplied by the estimated standard error of the parameter. If the absolute value of the value calculated for the t-test for each estimator is at least equal to the critical value, then the null hypothesis, meaning that the estimator is significant, will be rejected. The significance of all model parameters is one of the most important conditions that must be achieved as one of the most important diagnostic examination procedures. Also, the SACF of residuals series can be used to test whether the residuals series is identical to the white noise process . Therefore, the autocorrelation function of the residuals must not contain significant correlation coefficients as an enhancement measure to pass the diagnostic examination. There are other tests to diagnose the quality of the model, but the significance of the parameters and the ACF status of the residuals are the most prominent and most important for the model to be qualified to be used to forecast future values with good performance.
Regression tree (RT) model
Regression trees are a case of decision trees. Decision trees are defined as a graph that shows the actions that can be taken from natural states and their probabilities, and they are one of the forecasting models used in statistics, data mining, and machine learning. Decision trees are also important in analyzing decision issues that contain a series of decisions or a series of sequentially occurring states of nature.
Decision trees are created by dividing the original data set to form the root node of the tree and then dividing it into subgroups until the leaves represent the final decisions. Regression trees differ from classification trees which are the second type of trees, whose are forecasting results are categorical, just as the dependent variable in them is originally of the categorical type, while regression trees differ in terms of their outputs and forecasts, which are real values, because the dependent variable in them is a continuous variable. This method is based on the consideration that all elements of the studied society form one group and are then divided into two or more groups, subdivided, and so on. The process of branching and decision-making takes place through breaks represented as logical sentences and mathematical conditions placed on the joints of the tree in the form of nodes that branch into two or more branches, leading to the leaves as purposeful endings that represent the final decisions when the stopping order determined by the researcher is achieved (11).
The design of the classification and regression tree goes through several stages: (12, 13)
- The construction or building stage. This stage consists of several steps, which are:
- Determine the dependent variable .
- Choosing explanatory variables .
- Choosing the origin node so that it is suitable for the research goal.
- Determining the rules of branching (fission).
- The process of division or fragmentation.
- Stopping process.
- Pruning process.
- Assembly process.
- Draw the tree.
On each node , there are two subjects. The first one that expresses the left node resulting from the branching at the point , and the subject expresses the right node resulting from the branching at the point . The following figure 2 shows the regression tree.
Figure 2: General structure of the regression tree
The regression tree is designed using the characteristics of the data sets taken from the elements of sample size n, which include the observations values of the variables and the observation values of the corresponding groups in the dependent variable .
is the number of sample elements that belong to the region corresponding to the node , which is distributed among a number of groups that make up the dependent according to their properties. represents the number of sample elements that belong to the region and to the group , as the following relationship is satisfied.
The probabilities of belonging and distribution to groups can be calculated as follows (14).
The probability that any element of the samples belongs to the region corresponding to the node can be calculated as follows.
The probability that any element of the samples belonging to the region corresponding to the node belongs to the group can be calculated as follows.
Also, the probability that the sample elements corresponding to the node are distributed as a node to the left and a node to the right can be calculated, respectively, as follows.
The probability that any element of the sample corresponding to the node will go to the left node or the right node can be calculated, respectively, as follows.
Thus, the features of each group corresponding to the node can be determined according to what is proportional to it in terms of the number of sample elements to which it belongs from the region , according to their probabilities. The probabilities corresponding to the groups are compared and the group that corresponds to the greatest probabilities is chosen (15).
Wavelet Transform (WT) (16, 17)
Wavelet Transform or Wavelet Analysis is a mathematical analytical method used to process signals for many practical applications. The basis of this theory is based on Fourier's theory, which is a method used to represent periodic signals in the form of a sine and cosine series. This conversion transfers the signal from the time domain to the frequency domain and vice versa. The wavelet transform is considered a complement to spectral analysis, which depends on the conversion from the time domain to the frequency domain based on the rules of complex trigonometric or exponential functions that lead to the detection of periodicity in the wavelet movement by relying on the conversion to the field of measurement and displacement using the rules of orthogonal functions that lead to detection. About the regularity or irregularity of wave movement.
This short-time Fourier transform represents the signal using a specific window depending on its temporal and frequency resolution. The main problem in this conversion is the loss in time and frequency. High accuracy is obtained for signals that change quickly when using a small window, but this accuracy is not high for signals that change slowly and when using a large window, the exact opposite happens. For this reason, the theory of wavelet transformation is developed, so the wavelet transform is divided into three types: continuous wavelet transform, discrete wavelet transform, and wavelet packet transform. Perhaps the most famous types of wavelet functions can be represented graphically as follows:
|
|
|
|
|
Haar function and Mexican hat function
|
Daubchies functions db2 and db9
|
Bi-orthogonal functions bior 2.6
|
Figure 3: Some types of wavelet transform functions
Wavelets emerged an interest statistical in 1989 when Mallat introduced a multiple analysis method for processing discrete data (18). And (19) also showed that the contracted wavelet has ideal statistical properties that are desirable in problems associated with noise reduction. There are several useful applications for wavelets, such as data compression, image feature detection, and noise reduction from time series.
The measurements of forecasting Error (20)
One of the most important criteria used to express the accuracy of forecast data is as follows:
The mean absolute percentage errors is:
n represents the number of observations, is the variable of the original time series at time t, while represents the forecast variable, and represents the residuals series at time t.
Results and conclusions
Data used in this study
The ARIMA model will be studied using the Box-Jenkins method methodology and the RT method with the wavelet transform effect to analyze evaporation time series data in the city of Mosul/Iraq after separating it into hot and cold seasons. Data were taken from the Agricultural Meteorological Observatory at the location specified by longitude E = 43.16 and latitude N = 36.33. Because of the data it contains, it is due to the heterogeneity between the different seasons and the diversity of the seasonal seasons and their fluctuations. To achieve greater consistency, the data was separated into two seasons, the first being the cold months (November-December-January-February-March). The second season is hot for months (June - July - August - September). The study data included 372 hot season observations and 303 cold season observations as daily data. About 20% of the end-of-season time series data will be allocated as test period data and used as hypothetical future data used to test models built on the training period data. Therefore, the training data for the hot season will include (292 observations), while it will be (80 observations) at the end of the hot season as a testing period. As for the training period for the cold season, it will contain (243 observations) and (60 observations) for the cold time series as data for the test period.
Steps for the general framework of the practical aspect
- Alignment of evaporation time series data for homogeneity with data set for the training and testing periods for each season.
- Building an ARIMA model based on the Box-Jenkins methodology.
- Taking the right side of the ARIMA models as input for the RT method and setting it as an input matrix whose number of columns is the number of terms of the right side and the number of rows represents the number of observations of the training period. The target variable is the one on the left side of the ARIMA model.
- Using the WT method to analyze the components of the residuals series resulting from the ARIMA model and obtain forecasts from the ARIMA-Wavelet hybrid method for the training and testing periods for the two seasons.
- Modeling the training data for each season using the RT method, relying on the structure of the ARIMA models for each season and obtaining forecasts from the hybrid ARIMA-RT model for the training and testing periods.
- Modeling the training data for each season using the RT method, relying on the structure and results of the hybrid ARIMA-Wavelet method for each season, and obtaining forecasts from the hybrid RT-Wavelet model for the training and testing periods.
- Comparing the forecast results for the training and testing periods and for all seasons for the evaporation time series using the ARIMA, ARIMA-Wavelet, ARIMA-RT, and RT-Wavelet methods through the forecast error criterion.
ARIMA Model
The Box-Jenkins methodology is applied to analyze time series and obtain the best models that express the behavior of the data. The first step is to identify and diagnose the state of stability and the necessity of achieving it, while determining the ranks of the model. To reveal the stability of the data, the series will be plotted with the ACF and PACF functions. Figures 4 and 5 below show the time series plotting of evaporation and the ACF and PACF functions for the hot and cold seasons.
Figure 4. Time series plotting of evaporation, ACF and PACF for the hot season
Figure 5. Time series plotting of evaporation, ACF, and PACF for the cold season
Figures 4 and 5 above show the unitability (since more than 5% of correlation coefficients out of confidence interval)of the evaporation time series for the hot and cold seasons. Therefore, stability will be achieved by taking successive and seasonal differences. Through experimentation and testing, two decisions were made in parallel to achieve stability. In the first, one regular team and one seasonal team were taken, respectively. In the second, only one regular team was taken, for both seasons. It turned out that both decisions were correct, and each of them gave a different advantage than the other, while it turned out that the data was completely stable after making both decisions. To determine the ranks of the models, the ACF and PACF functions are drawn. Figures 6 and 7 show the ACF and PACF after achieving stability with both of the above decisions for the hot and cold evaporation seasons, respectively.
Figure 6. Time series plotting of evaporation, ACF and PACF for the hot season
Figure 7. Time series plotting of evaporation, ACF, and PACF for the cold season
From Figures 6 and 7 above, and according to the similar behavior of the ACF and PACF functions in both the hot and cold seasons, the best ARIMA models will be as follows:
- The first model: ARIMA(0,1,2) for both seasons and the results of significance are as in Table 1 below.
- The second model: ARIMA(1,1,1)(0,1,1)5 for both seasons and the results of significance are as in Table 2 below.
Table 1. Parameter values and their significance for the ARIMA(0,1,2) model for both hot and cold seasons.
|
Hot Season
|
|
|
Cold Season
|
|
|
Type
|
Parameter
|
Calc. t
|
p-value
|
|
|
Type
|
Parameter
|
Calc. t
|
p-value
|
|
|
|
0.4128
|
7.14
|
0.000
|
|
|
|
0.4216
|
7.12
|
0.000
|
|
|
|
0.1837
|
3.18
|
0.002
|
|
|
|
0.3881
|
6.55
|
0.000
|
|
Table 2. Parameter values and their significance for the ARIMA(1,1,1)(0,1,1)5 model for both hot and cold seasons.
|
Hot Season
|
|
|
Cold Season
|
|
|
Type
|
Parameter
|
Calc. t
|
p-value
|
|
|
Type
|
Parameter
|
Calc. t
|
p-value
|
|
|
|
0.2886
|
2.75
|
0.006
|
|
|
|
0.5228
|
7.89
|
0.000
|
|
|
|
0.7182
|
9.32
|
0.000
|
|
|
|
0.9354
|
34.06
|
0.000
|
|
|
|
0.9726
|
49.33
|
0.000
|
|
|
|
0.9446
|
30.79
|
0.000
|
|
From Tables 1 and 2 above, for both seasons, it is clear that all estimated parameter values for the ARIMA(0,1,2) and ARIMA(1,1,1)(0,1,1)5 models are significant. The ACF function was also tested for the residuals of the above models and it turned out that it conforms to the conditions of good models. Thus, the above two models have successfully passed the diagnostic tests. Table 3 below shows the MAPE error criterion values for the training and testing periods for both hot and cold seasons for the above two models.
Table 3. MAPE error criterion values for ARIMA models for the training and testing periods for hot and cold seasons.
|
Models
|
Hot Season
|
|
|
Cold Season
|
|
|
Training
|
Testing
|
|
|
Training
|
Testing
|
|
|
ARIMA(0,1,2)
|
7.48748
|
59.96663
|
|
|
17.49708
|
48.69273
|
|
|
ARIMA(1,1,1)(0,1,1)5
|
7.56499
|
60.77533
|
|
|
16.68917
|
56.71475
|
|
From Table 3 above, it is noticeable that preference varies from one model to another and it is not possible to judge the better predictive performance of one model over the other.
RT model
The structure of the models that were concluded as the best ARIMA models and referred to above was relied upon to be used in building the structure of the input regression tree models, whose foundations, equations, and methods of use in the theoretical aspect are indicated, and it is conventionally referred to as the hybrid ARIMA-RT model. Table 4 below shows the MAPE error criterion values for the training and testing periods for both hot and cold seasons for the hybrid ARIMA-RT model based on the above ARIMA models.
Table 4 the MAPE error criterion values for the hybrid ARIMA-RT model for the training and testing periods for both hot and cold seasons
|
Models
|
Hot Season
|
|
|
Cold Season
|
|
|
Training
|
Testing
|
|
|
Training
|
Testing
|
|
|
ARIMA(0,1,2)
|
4.16565
|
32.97756
|
|
|
8.174444
|
63.02950
|
|
|
ARIMA(1,1,1)(0,1,1)5
|
5.04347
|
32.41332
|
|
|
10.77235
|
66.52910
|
|
Table 4 above, it is noticeable that the preference varies from one RT model to another, and it is not possible to judge the better forecasting performance of one model over the other. Comparing the results to table 3 and 4 above, the hybrid ARIMA-RT model most likely outperformed the ARIMA model, unlike the results of the testing period for the cold season only.
Wavelet transform
The wavelet transform WT was used in this study to purify the variable of the residuals series of the ARIMA model, which is referred to as the hybrid ARIMA-Wavelet model, then returned it to the original ARIMA model from which it was taken. The residuals series variate of the hybrid ARIMA-RT model was also purified based on WT in the same previous manner and is referred to as the hybrid RT-Wavelet model. This purification is carried out through the analysis of the components and periodic influences that control the non- linear and heterogeneous behavior of the original series and the resulting uncertainty in the modeling, which negatively affects the forecast performance and the forecasting residuals affected by all these influences and obstacles, using rules and foundations that lead to detecting regularity or irregularity of wave movement. Using the ready-made tool in the MATLAB system regarding stochastic modeling with one-dimensional regression estimators, it therefore requires determining the type and level of the wave function, in addition to the importance of structuring the inputs and organizing their numbers and dimensions in a scientifically correct manner. The Haar wave functions are used, as well as db1, db2, and db3 at the fifth level, after obtaining the results, the priority forecasting performance was concluded after using the db2 function for the hot season and the db3 function for the cold season. The results were as in table 5 below, which shows the values of the MAPE the error criterion for the training and testing periods for both the hot and cold seasons for the ARIMA-Wavelet and the hybrid RT-Wavelet model, according on the previously mentioned ARIMA model.
Table 5. values of the MAPE for the hybrid ARIMA-Wavelet and RT-Wavelet models, based on ARIMA for the training and testing periods for both the hot and cold seasons.
|
Models
|
Hot Season
|
|
|
Cold Season
|
|
|
Training
|
Testing
|
|
|
Training
|
Testing
|
|
|
ARIMA-Wavelet
|
ARIMA(0,1,2)
|
1.67931
|
61.99826
|
|
|
2.833533
|
49.71198
|
|
|
ARIMA(1,1,1)(0,1,1)5
|
1.61756
|
62.81019
|
|
|
3.405706
|
57.24044
|
|
|
RT-Wavelet
|
ARIMA(0,1,2)
|
4.07420
|
32.95010
|
|
|
8.176733
|
63.02666
|
|
|
ARIMA(1,1,1)(0,1,1)5
|
5.05101
|
32.40999
|
|
|
10.79895
|
66.52363
|
|
By comparison with results of the previous tables and from table 5 above, the forecasting performance of the hybrid ARIMA-Wavelet model was superior in the training periods and its failure to perform well in the testing periods, which may be explained by the lack of observations in it when compared to the testing period. As for the hybrid RT-Wavelet model, it didn’t add anything new to the accuracy of forecast performance expect for a slight improvement compered to the original hybrid ARIMA-RT model, which lead us to the increase in the number of techniques and methods used to reach good forecasts may be negatively reflected in the complexity of the methodology of the proposed method, with the absence of noticeable improvement in forecast accuracy. figures 8 and 9 below illustrate and compare forecast errors (residuals) graphically for the hot and cold seasons and for the training and testing periods.
|
ARIMA(0,1,2)
|
ARIMA(1,1,1)(0,1,1)5
|
|
|
|
Training
|
|
|
Testing
|
Figure 8. comparison of evaporation time series forecast errors for models used for the hot seasons
|
ARIMA(0,1,2)
|
ARIMA(1,1,1)(0,1,1)5
|
|
|
|
Training
|
|
|
Testing
|
Figure 9. comparison of evaporation time series forecast errors for models used for the cold seasons
Figures 8 and 9 above give a clearer picture than the results of the forecast performance of the ARIMA, ARIMA-Wavelet, ARIMA-RT, and RT-Wavelet methods reported in the previous tables. From the above figures, the superiority of the forecasting performance of the hybrid ARIMA-Wavelet and RT-Wavelet models in the training and testing periods is evident through the smoothness that characterizes the residuals of the two models, which indicates a high behavioral match of the original evaporation series variable with the corresponding forecast series variable for the hot and cold seasons periods and for the two seasons, which reflects the high quality of forecasting performance of the hybrid ARIMA-Wavelet and RT-Wavelet methods.