1. Introduction
Linear regression is one of the frequently used statistical methods that have applications in all field of daily life. In a statistical perspective, the regression analysis is used for studying the dependence relationship between a dependent (response) variable and a set of independent (predictor) variables (Rawlings et al, 1998). In general, the most popular method used for regression is ordinary least squares (OLS) for its ease and simplicity. The OLS method is claimed to be unbiased, efficient and consistent estimator as compared to other linear regression model are satisfied. If the assumption is violated, the OLS method will no longer produce the least variance, leading to the inefficiency in estimating a model. One of the assumptions is that there is no exact linear relationship between the explanatory variables (Zahari et al, 2014).
Multicollinearity refers to a situation in which or more predictor variables in a multiple regression model are highly correlated if multicolinearity is perfect, the regression coefficients are indeterminate and their standard errors are infinite, if it is less than perfect (Dereny etal, 2011). There are several techniques used for the reduction of multicolinearity problem. Some of these techniques can be listed as: obtaining more data, the removal of one or more independent variables from the model, clustering the independent variables, and biased estimation techniques (Tunah and Siklar, 2015).
The ridge regression is the most widely model in solving the multicolinearity problem, and it's an alternative to OLS. The main advantage of ridge regression method is to reduce the variance term of the slope parameters (Alibuhatto, 2016). The aims of this study are to study the ridge regression method, which resolves multicolinearity without removing independent variables from the model but provides biased estimator to study the effect of some meteorological factors on the rainfall.
2. Theoretical Part
2. 1. Regression Model
Linear regression model is the relationship between a dependent variable and a set of independent variables as (Olandrewaju et al, 2017).
(1)
Where; is the response variable, are explanatory variables, is error term, and are the regression coefficients.
In matrix form, the model can be written as:
(2)
Where; is vector of observations on dependent variables, is a matrix, is vector of error term, and is a vector of regression coefficients.
The OLS estimate of is obtained by minimizing the residual sum of squares (Salh, 2014).
(3)
Then the best linear unbiased estimator of is
(4)
With,
(5)
(6)
(7)
Assumptions made about the error and the variables:
- is a random vector.
-
-
-
- X is non-stochastic matrix.
- There is no correlation between the non-stochastic x and the stochastic , i.e
- The x variables are linearly independent, so
Thus, x matrix has rank
2. 2. Multicollinearity
Multicollinearity is a statistical tool in which there exists a perfect relationship between the explanatory variables. When there is a perfect relationship between the explanatory variables, it is difficult to come up with reliable estimates of their individual coefficients. It will result in incorrect conclusions about the relationship between dependent variable and explanatory variables (Alibuhatto, 2016).
There are two types of multicollinearity (El-Sibakhi, 2016):
- Perfect Multicollinearity
If exist perfect linear relationship among the explanatory variables then it is treated as exact multicollinearity. In case of perfect multicollinearity the design matrix as data matrix is not of full rank and consequently doesnot exist. In this case
- Semi- Perfect Multicollineartity
If the explanatory variables are strongly as highly correlated but not perfectly then it is called semi- perfect mulitcolinearity. In this case is exist but, with related large diagonal elements. Multicollineartity has several effects; these are described as follows (Dereny et al, 2011), (El-Sibakhi, 2016):
- High variance of coefficients my reduced the precision of estimation.
- Multicollineartity can result in coefficients appearing to have the wrong sign.
- Estimates of coefficients may be sensitive to particular sets of sample data.
- Some variables may be dropped from the model although they are important in the population.
- The coefficients are sensitive of the presence of small number inaccurate data values.
2. 3. Detection of Multicollinearity
- Correlation Matrix
Compute the correlation coefficients between any two of the explanatory variables. A high significant value of the correlation between two variables may indicate that the variables are collinear. This method is easy, but it cannot produce a clear estimate of the rate of multicollinearity (Alibuhatto, 2016).
- The Variance Inflation Factor(VIF)
The VIF is computed from the correlation matrix of the independent variables (Rawlings et al, 1998), (Montgomery and Runger, 2002), (Raheem et al, 2019).
(8)
is coefficient of determination in the regression of explanatory variables on the remaining explanatory variables of the model.
- Condition Number
The eigen values of the correlation matrix can also be used to measure the presence of multicollinearity. If multicollinearity is present in the predictor variables one or more of the egien values will be small. Let be the egien values of correlation matrix. The condition number of correlation matrix is defined as:
(9)
If the condition number is less than 100, there is no serious problem with multicollinearity and if a condition number is between 100 and 1000 implies a moderate to strong multicollinearity. Also, if the condition number exceeds 1000, severe multicollinearity is indicated (Alibuhatto, 2016).
- Eigen structure of , Let be the egien values of .when at least one eigen values is close to zero, then multicollineartity is exist (Dereny et al, 2011).
- Checking the relationship between the F and T test might provide some indication of the presence of multicollinearity. If the overall significance of the model is good by using F- test but individually the coefficients are not significant by using T- test, then the model might suffer from multicollinearity (El-Sibakhi, 2016), (Raheem et al, 2019).
2.4. Ridge Regression
Ridge regression represents one of the methods which deal with multicollinearity problem (Kamel and Aboud, 2013). A possible remedy to this problem is the ridge estimator suggested by Hoerl and Kennard (Gullkey and Murrhy, 1975) represented it in 1970 (Kamel and Aboud, 2013). This reduces the variance of the estimates at the expense of introducing some degree of bias. This is accomplished by adding a small positive number, k, to each of diagonal elements of correlation matrix. The ridge estimator is shown as follow (Fitrianto and Yik, 2014).
(10)
Where, the I denote an identity matrix and is ridge parameter.
The ridge regression estimator has several properties, which can be summarized as follow:
(11)
Where
(12)
(13)
Where, is a biased estimator, but reduce the variance of the estimate, and is the coefficient vector with minimum length. The MSE of is given by:
(14)
3. Application Part
The data was obtained from the meteorological directorate of Sulaimani for the period (Jan. 2012- Aug. 2017) in order to reach an appropriate model, have been used NCSS19 and SPSS22.
The data that is including one response variable ( ) and seven explanatory variables ( ):
Rainfall
Average Temperature
Relative Humidity
Wind Speed
Average Vapors
Sunshine
Station Pressure
Soil Temperature
Now since some of the variables are significantly related as shown in table (1).The results of the correlation matrix above, showed a highly significant possible relationships between variables. These results showed that there is presence of multicollinearity among these independent variables.
Table 1: Correlation matrix of the variables
|
Variables
|
|
|
|
|
|
|
|
y
|
| |
1
|
|
|
|
|
|
|
|
| |
-.893**
|
1
|
|
|
|
|
|
|
| |
.174
|
-.171
|
1
|
|
|
|
|
|
| |
.854**
|
-.624**
|
.201
|
1
|
|
|
|
|
| |
.846**
|
-.777**
|
.321**
|
.678**
|
1
|
|
|
|
| |
-.564**
|
.566**
|
.332**
|
-.347**
|
-.343**
|
1
|
|
|
| |
.932**
|
-.827**
|
.057
|
.804**
|
.748**
|
-.522**
|
1
|
|
|
y
|
-.665**
|
.635**
|
-.159
|
-.526**
|
-.636**
|
.348**
|
-.596**
|
1
|
** Correlation is significant at the 0.01 level.
The existence of multicollinearity was investigated using variance inflation factor (VIF) and condition number. The VIF for all independent variables are as follow:
, , , , , ,
The result of VIF revealed presence of multicollinearity at is greater than 10. This result confirmed a high level of multicollinearity among the independent variables.
The eigenvalues of the correlation matrix as follow:
, , , , , ,
The condition number (
The results also indicate the presence strong multicollinearity between variables. To estimate coefficients with the minimum variance it is need to resolve this multicollinearity. The parameter estimations ( calculated with k in the range of [0, 1] in order to see the effects of multicollinearity, trying to resolve with ridge regression technique, on the coefficients are given in table (2).
Table 2: Standardized ridge regression coefficients and max VIF.
|
K
|
|
|
|
|
|
|
|
Max VIF
|
|
0.000
|
-.539
|
0.151
|
0.024
|
0.052
|
-0.214
|
-0.038
|
0.129
|
36.854
|
|
0.001
|
-.521
|
0.155
|
0.023
|
0.047
|
-0.217
|
-0.035
|
0.123
|
33.601
|
|
0.002
|
-0.505
|
0.159
|
0.022
|
0.042
|
-0.219
|
-0.033
|
0.117
|
30.761
|
|
0.003
|
-0.489
|
0.162
|
0.021
|
0.038
|
-0.221
|
-0.031
|
0.112
|
28.267
|
|
0.004
|
-0.475
|
0.165
|
0.021
|
0.034
|
-0.223
|
-0.029
|
0.107
|
26.066
|
|
0.005
|
-0.463
|
0.168
|
0.020
|
0.030
|
-0.225
|
-0.028
|
0.102
|
24.112
|
|
0.006
|
-0.451
|
0.171
|
0.019
|
0.027
|
-0.227
|
-0.026
|
0.098
|
22.371
|
|
0.007
|
-0.439
|
0.173
|
0.018
|
0.024
|
-0.228
|
-0.025
|
0.094
|
20.812
|
|
0.008
|
-0.429
|
0.175
|
0.018
|
0.021
|
-0.229
|
-0.023
|
0.090
|
19.412
|
|
0.009
|
-0.419
|
0.177
|
0.017
|
0.018
|
-0.231
|
-0.022
|
0.086
|
18.148
|
|
0.010
|
-0.410
|
0.178
|
0.016
|
0.016
|
-0.232
|
-0.020
|
0.083
|
17.004
|
|
0.020
|
-0.344
|
0.190
|
0.011
|
-0.001
|
-0.239
|
-0.011
|
0.055
|
9.767
|
|
0.030
|
-0.304
|
0.194
|
0.006
|
-0.012
|
-0.241
|
-0.004
|
0.035
|
6.343
|
|
0.040
|
-0.276
|
0.196
|
0.003
|
-0.019
|
-0.241
|
0.000
|
0.020
|
4.456
|
|
0.050
|
-0.256
|
0.197
|
0.000
|
-0.024
|
-0.242
|
0.004
|
0.009
|
3.307
|
|
0.060
|
-0.241
|
0.196
|
-0.002
|
-0.028
|
-0.241
|
0.008
|
-0.000
|
2.860
|
|
0.070
|
-0.230
|
0.195
|
-0.005
|
-0.031
|
-0.240
|
0.010
|
-0.008
|
2.546
|
|
0.080
|
-0.220
|
0.194
|
-0.007
|
-0.034
|
-0.238
|
0.013
|
-0.015
|
2.285
|
|
0.090
|
-0.212
|
0.193
|
-0.009
|
-0.036
|
-0.236
|
0.015
|
-0.021
|
2.065
|
|
0.100
|
-0.206
|
0.192
|
-0.010
|
-0.038
|
-0.234
|
0.017
|
-0.027
|
1.878
|
|
0.200
|
-0.172
|
0.178
|
-0.021
|
-0.051
|
-0.232
|
0.029
|
-0.058
|
1.007
|
|
0.300
|
-0.158
|
0.168
|
-0.027
|
-0.058
|
-0.212
|
0.036
|
-0.072
|
0.680
|
|
0.400
|
-0.149
|
0.159
|
-0.030
|
-0.063
|
-0.196
|
0.040
|
-0.080
|
0.515
|
|
0.500
|
-0.143
|
0.152
|
-0.031
|
-0.067
|
-0.174
|
0.042
|
-0.084
|
0.421
|
|
0.600
|
-0.138
|
0.147
|
-0.032
|
-0.069
|
-0.166
|
0.044
|
-0.087
|
0.362
|
|
0.700
|
-0.134
|
0.142
|
-0.032
|
-0.071
|
-0.159
|
0.045
|
-0.088
|
0.316
|
|
0.800
|
-0.131
|
0.137
|
-0.032
|
-0.072
|
-0.153
|
0.046
|
-0.089
|
0.279
|
|
0.900
|
-0.128
|
0.133
|
-0.032
|
-0.072
|
-0.147
|
0.046
|
-0.090
|
0.248
|
|
1.000
|
-0.125
|
0.129
|
-0.032
|
-0.073
|
-0.142
|
0.046
|
-0.089
|
0.223
|
The regression coefficients and standard errors of these coefficients can be summarized in table (3), by using both OLS and RR methods to analyze the data, we get the following results.
Table 3: Regression coefficients and standard errors
|
Independent variable
|
Ridge Coefficient
|
Least Square Coefficient
|
Ridge Standard Error
|
Least Square
Standard Error
|
|
intercept
|
204.995
|
428.476
|
|
|
| |
-2.544
|
-3.986
|
1.977
|
3.821
|
| |
0.779
|
0.621
|
0.735
|
0.975
|
| |
1.261
|
2.804
|
11.906
|
12.688
|
| |
-0.053
|
1.604
|
4.843
|
6.701
|
| |
-6.902
|
-6.195
|
4.843
|
5.239
|
| |
-0.101
|
-0.346
|
1.091
|
1.234
|
| |
0.538
|
1.266
|
1.909
|
2.489
|
In the study for (Jan. 2012- Aug. 2017) period, ridge parameter k was (0.02) and the ridge regression, which indicates the effects of independent variables to the rainfall in Sulaimani, is estimated as
And ordinary least square model, is estimated as
Table 4: Analysis of variance for k = 0.02
| |
|
|
|
|
|
| |
1
|
250937.5
|
250937.5
|
|
|
| |
7
|
190049.7
|
27149.96
|
9.0330
|
0.00*
|
| |
73
|
219412.1
|
3005.645
|
|
|
| |
80
|
409461.8
|
5118.272
|
|
|
|
|
The root mean squares error of regression coefficients for RR and OLS methods are as follow:
,
And the coefficient of determination ( ) for each model, we obtain the following result:
,
We make a comparison between ridge regression and ordinary least squares. We noted that ridge regression model is better than ordinary least square model when the multicollinearity problem is exist because it has smaller mean square errors of estimators, smaller standard deviation for all estimators and has large coefficient of determination.
4. Conclusions
According to the results of this study the multicollinearity was detected, because variance inflation factor for equal (36.854) greater than 10 and condition number equal (215.44) greater than 100, this confirmed that the multicollinearty problem is existing. The most direct variables affecting the amount of rainfall are the average temperature which affects (-0.665), followed by sunshine that affects (-0.636), then relative humidity (0.635), then soil temperature (-0.596), and then other meteorological variables. The (k=0.02) value is the optimal value that resolves the multicollinearity problem. The ridge regression model is better than ordinary least square model when the multicollinearity problem is exist, because it has smaller mean square errors of estimators, smaller standard deviation for all estimators and has large coefficient of determination.