Introduction
The appearance of puberty in the female Many real data problems, such as automobile insurance claims, healthcare economics, and medical science, can be studied using the Gamma Regression Model (GRM) (1, 2, 3). A GRM is used particularly when a study's response variable is positively skewed or not normally distributed. As a result, gamma regression requires gamma distributions for the response variables (4, 5, 6).
The GRM presumes that there is no correlation between the regressors. However, in reality, this presumption frequently fails, which creates the multicollinearity issue. In the presence of multicollinearity, gamma regression coefficients are typically unstable with a large variance and poor statistical significance when estimated using the maximum likelihood (ML) approach (7, 8). To solve the multicollinearity issue, many solutions have been presented out. It has been frequently shown that the ridge regression approach (9) is a desirable replacement for the ML estimation method.
The following relationship is typically used in classical linear regression models:
where is an vector of response variable observations, is a known design matrix of explanatory variables is a vector of unknown regression coefficients, and is an vector of random errors with mean 0 and variance .
In order to decrease the high variance, the ridge regression shrinkage approach compresses all regression coefficients in the direction of zero (7, 10). The diagonal of is raised in a positive direction to achieve this. The ridge estimator has a lower mean squared error than the ML estimator due to its bias.
The ridge estimator in linear regression is defined as:
With I as the identity matrix of size and as the ridge parameter (shrinkage parameter) which controls the shrinkage of toward zero. A larger value of yields greater shrinkage for the estimator (9).
Statistical Methodology
Gamma Ridge Regression Model (GRRM)
There are often positive skewed data used in studies in sociology, economics, and epidemiology, these kinds of data do not have any negative numbers, making the gamma distribution an ideal choice for these kinds of data (5). If be the response variable and has a gamma distribution with nonnegative shape parameter and nonnegative scale parameter , i.e. , then the probability density function is defined as (6, 11):
with . When the parameter is known, it is shown that the response variable's variance is proportional to the square of its mean.
In a GRM, is expressed as a linear combination of repressors . The is called the log link function is what gives the relationship between the predictors and the response variable its linear shape. This log like function is alternatively used rather than the canonical link function (reciprocal link function, ) because it ensures that
Using the Maximum likelihood technique is the most typical way to estimate the GRM coefficients. Considering that the observations are presumed to be independent and the log-likelihood function is given by:
The first derivative of Eq. (4) is then calculated and set to zero to get the ML estimator, as:
Unfortunately, the first derivative cannot be analytically calculated since Eq. (5) is nonlinear. The ML estimators of the gamma regression parameters may be obtained using either the iteratively weighted least squares (IWLS) technique or Newton-Raphson approach. In each iteration, the parameters are updated by:
Where . The estimated coefficients final step is defined as
Where and is a vector where element equals to . ML estimators are normally distributed with covariance matrices that are inverses of Hessian matrices.
Eq. (7)'s mean squared error (MSE) can be calculated as follows:
where is the eigenvalue of the matrix. The matrix becomes ill-conditioned in the presence of multicollinearity, the ML estimator of the gamma regression parameters becomes unstable and has an excessive amount of variation. As a remedy, the gamma ridge regression model (GRRM) can be described as:
where . A specific estimator from Eq. (10) with might be thought of as the ML estimator.
Generalized ridge estimator
The generalized ridge estimator (GRE), differs from the generalized ridge regression model (GRR) in that it takes values of into account (9).
where. Finding the optimal values of while using GRE is advantageous because the MSE is smaller than when the ridge estimator and OLS are used.
The definition of the GRE for the gamma regression model (GRM) is:
matrix selection must be carefully considered. Several approaches are modified to estimate in this study, including (9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22). These approaches are listed below, in order.
where is defined as the element of and is the eigenvector of the and the dispersion parameter, , is estimated by
Modeling and simulation
With the help of Monte Carlo simulations, the effectiveness of these approaches is examined using the GGRRM and different levels of multicollinearity.
The design of simulations
The GRM's response variable for n observations is produced by (8, 11, 23, 24, 25, 26, 27, 28). where with and [29], and . Explanatory variables have been generated from the following formula:
where represents the correlation between explanatory variables and ’s are independent standard normal pseudo-random numbers. Three exemplary sample size values 50, 100, and 200 are taken into consideration since the sample size directly influences prediction accuracy. Additionally, the number of explanatory factors is taken into account as and because doing so might result in an increase in the MSE. Further, three values of the pairwise correlation are taken into consideration with since we are interested in the influence of multicollinearity, in which the degrees of correlation are deemed more essential. The produced data is repeated 1,000 times for a combination of these various values of , and , the averaged mean squared errors (MSE) is determined as follows:
The results of simulations
There are six tables showing the averaged MSE for the combinations of , and . Throughout the table, the best MSE value is highlighted to emphasize its importance. The following are some possible observations:
- GRRM frequently has a lower MSE than MLE.
- GGRRM achieved less MSE than GRRM, regardless of the estimating method of the matrix K.
- A comparison of the F method with other approaches revealed that the gamma generalized ridge estimator was significantly enhanced by Firinguetti (15) in Eq. (16)). HK and SB procedures consistently produced inadequate results, when compared with other approaches tested.
- MSE values increase as the degree of correlation increases with respect to p, regardless of the values of and .
- In terms of the number of explanatory variables, it is easy to see that there is a negative impact on MSE, where their values rise as p increases.
- The MSE values decrease with increasing n, regardless of the values of , or
- As v increases, the MSE of all methods decreases for fixed n, p, and degree of multicollinearity.
Table 1: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.2411
|
3.4042
|
3.6188
|
3.1475
|
3.3091
|
3.5793
|
|
GRRM
|
1.7671
|
1.7884
|
1.7661
|
1.6512
|
1.6981
|
1.7438
|
|
HK
|
1.3218
|
1.3761
|
1.3907
|
1.3147
|
1.3286
|
1.3408
|
|
N
|
0.9852
|
0.9934
|
0.9958
|
0.9780
|
0.9762
|
0.9714
|
|
TC
|
1.0638
|
1.1181
|
1.1327
|
1.0567
|
1.0706
|
1.0828
|
|
F
|
0.5384
|
0.5927
|
0.6073
|
0.5313
|
0.5452
|
0.5574
|
|
HSL
|
0.9508
|
1.0051
|
1.0197
|
0.9437
|
0.9576
|
0.9698
|
|
AH
|
0.876
|
0.9303
|
0.9449
|
0.8689
|
0.8828
|
0.895
|
|
D
|
0.7713
|
0.7821
|
0.7877
|
0.7629
|
0.7711
|
0.7836
|
|
SB
|
1.0685
|
1.1228
|
1.1374
|
1.0614
|
1.0753
|
1.0875
|
|
SV1
|
0.9094
|
0.9637
|
0.9783
|
0.9023
|
0.9162
|
0.9284
|
|
SV2
|
0.8547
|
0.909
|
0.9236
|
0.8476
|
0.8615
|
0.8737
|
|
M
|
0.8735
|
0.9273
|
0.9419
|
0.8659
|
0.8798
|
0.8921
|
|
AS
|
0.9344
|
0.9887
|
1.0033
|
0.9273
|
0.9412
|
0.9534
|
Table 2: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.5658
|
3.7289
|
3.9435
|
3.4722
|
3.6338
|
3.904
|
|
GRRM
|
2.0918
|
2.1131
|
2.0908
|
1.9759
|
2.0228
|
2.0685
|
|
HK
|
1.6465
|
1.7008
|
1.7154
|
1.6394
|
1.6533
|
1.6655
|
|
N
|
1.3099
|
1.3181
|
1.3205
|
1.3027
|
1.3009
|
1.2961
|
|
TC
|
1.3885
|
1.4428
|
1.4574
|
1.3814
|
1.3953
|
1.4075
|
|
F
|
0.8631
|
0.9174
|
0.932
|
0.856
|
0.8699
|
0.8821
|
|
HSL
|
1.2755
|
1.3298
|
1.3444
|
1.2684
|
1.2823
|
1.2945
|
|
AH
|
1.2007
|
1.255
|
1.2696
|
1.1936
|
1.2075
|
1.2197
|
|
D
|
1.096
|
1.1068
|
1.1124
|
1.0876
|
1.0958
|
1.1083
|
|
SB
|
1.3932
|
1.4475
|
1.4621
|
1.3861
|
1.4
|
1.4122
|
|
SV1
|
1.2341
|
1.2884
|
1.303
|
1.227
|
1.2409
|
1.2531
|
|
SV2
|
1.1794
|
1.2337
|
1.2483
|
1.1723
|
1.1862
|
1.1984
|
|
M
|
1.1982
|
1.252
|
1.2666
|
1.1906
|
1.2045
|
1.2168
|
|
AS
|
1.2591
|
1.3134
|
1.328
|
1.252
|
1.2659
|
1.2781
|
Table 3: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.1355
|
3.2986
|
3.5132
|
3.0419
|
3.2035
|
3.4737
|
|
GRRM
|
1.6615
|
1.6828
|
1.6605
|
1.5456
|
1.5925
|
1.6382
|
|
HK
|
1.2162
|
1.2705
|
1.2851
|
1.2091
|
1.223
|
1.2352
|
|
N
|
0.8796
|
0.8878
|
0.8902
|
0.8724
|
0.8706
|
0.8658
|
|
TC
|
0.9582
|
1.0125
|
1.0271
|
0.9511
|
0.965
|
0.9772
|
|
F
|
0.4328
|
0.4871
|
0.5017
|
0.4257
|
0.4396
|
0.4518
|
|
HSL
|
0.8452
|
0.8995
|
0.9141
|
0.8381
|
0.852
|
0.8642
|
|
AH
|
0.7704
|
0.8247
|
0.8393
|
0.7633
|
0.7772
|
0.7894
|
|
D
|
0.6657
|
0.6765
|
0.6821
|
0.6573
|
0.6655
|
0.678
|
|
SB
|
0.9629
|
1.0172
|
1.0318
|
0.9558
|
0.9697
|
0.9819
|
|
SV1
|
0.8038
|
0.8581
|
0.8727
|
0.7967
|
0.8106
|
0.8228
|
|
SV2
|
0.7491
|
0.8034
|
0.818
|
0.742
|
0.7559
|
0.7681
|
|
M
|
0.7679
|
0.8217
|
0.8363
|
0.7603
|
0.7742
|
0.7865
|
|
AS
|
0.8288
|
0.8831
|
0.8977
|
0.8217
|
0.8356
|
0.8478
|
Table 4: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.2564
|
3.4195
|
3.6341
|
3.1628
|
3.3244
|
3.5946
|
|
GRRM
|
1.7824
|
1.8037
|
1.7814
|
1.6665
|
1.7134
|
1.7591
|
|
HK
|
1.3371
|
1.3914
|
1.406
|
1.33
|
1.3439
|
1.3561
|
|
N
|
1.0005
|
1.0087
|
1.0111
|
0.9933
|
0.9915
|
0.9867
|
|
TC
|
1.0791
|
1.1334
|
1.148
|
1.072
|
1.0859
|
1.0981
|
|
F
|
0.5537
|
0.608
|
0.6226
|
0.5466
|
0.5605
|
0.5727
|
|
HSL
|
0.9661
|
1.0204
|
1.035
|
0.959
|
0.9729
|
0.9851
|
|
AH
|
0.8913
|
0.9456
|
0.9602
|
0.8842
|
0.8981
|
0.9103
|
|
D
|
0.7866
|
0.7974
|
0.803
|
0.7782
|
0.7864
|
0.7989
|
|
SB
|
1.0838
|
1.1381
|
1.1527
|
1.0767
|
1.0906
|
1.1028
|
|
SV1
|
0.9247
|
0.979
|
0.9936
|
0.9176
|
0.9315
|
0.9437
|
|
SV2
|
0.87
|
0.9243
|
0.9389
|
0.8629
|
0.8768
|
0.889
|
|
M
|
0.8888
|
0.9426
|
0.9572
|
0.8812
|
0.8951
|
0.9074
|
|
AS
|
0.9497
|
1.004
|
1.0186
|
0.9426
|
0.9565
|
0.9687
|
Table 5: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.00376
|
3.16686
|
3.38146
|
2.91016
|
3.07176
|
3.34196
|
|
GRRM
|
1.52976
|
1.55106
|
1.52876
|
1.41386
|
1.46076
|
1.50646
|
|
HK
|
1.08446
|
1.13876
|
1.15336
|
1.07736
|
1.09126
|
1.10346
|
|
N
|
0.74786
|
0.75606
|
0.75846
|
0.74066
|
0.73886
|
0.73406
|
|
TC
|
0.82646
|
0.88076
|
0.89536
|
0.81936
|
0.83326
|
0.84546
|
|
F
|
0.30106
|
0.35536
|
0.36996
|
0.29396
|
0.30786
|
0.32006
|
|
HSL
|
0.71346
|
0.76776
|
0.78236
|
0.70636
|
0.72026
|
0.73246
|
|
AH
|
0.63866
|
0.69296
|
0.70756
|
0.63156
|
0.64546
|
0.65766
|
|
D
|
0.53396
|
0.54476
|
0.55036
|
0.52556
|
0.53376
|
0.54626
|
|
SB
|
0.83116
|
0.88546
|
0.90006
|
0.82406
|
0.83796
|
0.85016
|
|
SV1
|
0.67206
|
0.72636
|
0.74096
|
0.66496
|
0.67886
|
0.69106
|
|
SV2
|
0.61736
|
0.67166
|
0.68626
|
0.61026
|
0.62416
|
0.63636
|
|
M
|
0.63616
|
0.68996
|
0.70456
|
0.62856
|
0.64246
|
0.65476
|
|
AS
|
0.69706
|
0.75136
|
0.76596
|
0.68996
|
0.70386
|
0.71606
|
Table 6: Average MSE values when and
|
Methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MLE
|
3.10106
|
3.26416
|
3.47876
|
3.00746
|
3.16906
|
3.43926
|
|
GRRM
|
1.62706
|
1.64836
|
1.62606
|
1.51116
|
1.55806
|
1.60376
|
|
HK
|
1.18176
|
1.23606
|
1.25066
|
1.17466
|
1.18856
|
1.20076
|
|
N
|
0.84516
|
0.85336
|
0.85576
|
0.83796
|
0.83616
|
0.83136
|
|
TC
|
0.92376
|
0.97806
|
0.99266
|
0.91666
|
0.93056
|
0.94276
|
|
F
|
0.39836
|
0.45266
|
0.46726
|
0.39126
|
0.40516
|
0.41736
|
|
HSL
|
0.81076
|
0.86506
|
0.87966
|
0.80366
|
0.81756
|
0.82976
|
|
AH
|
0.73596
|
0.79026
|
0.80486
|
0.72886
|
0.74276
|
0.75496
|
|
D
|
0.63126
|
0.64206
|
0.64766
|
0.62286
|
0.63106
|
0.64356
|
|
SB
|
0.92846
|
0.98276
|
0.99736
|
0.92136
|
0.93526
|
0.94746
|
|
SV1
|
0.76936
|
0.82366
|
0.83826
|
0.76226
|
0.77616
|
0.78836
|
|
SV2
|
0.71466
|
0.76896
|
0.78356
|
0.70756
|
0.72146
|
0.73366
|
|
M
|
0.73346
|
0.78726
|
0.80186
|
0.72586
|
0.73976
|
0.75206
|
|
AS
|
0.79436
|
0.84866
|
0.86326
|
0.78726
|
0.80116
|
0.81336
|
Application of real data
Here, we offer a chemical dataset with where n denotes the quantity of antifungal drugs, to illustrate the applicability of the GGRRM estimator in practical applications. pMIC (the logarithm of reciprocal of MIC, where MIC is the lowest inhibitory concentration against C. albicans in mM/L) was used to quantify the antibacterial activity. As explanatory variables, molecular descriptors are represented by the integer (29, 30). In chemometrics, the quantitative structure-activity relationship (QSAR) investigation has gained significant attention. The fundamental idea behind QSAR is to simulate various biological functions across a group of chemical substances in terms of their structural characteristics. Regression modeling is therefore one of the most crucial techniques for building the QSAR model. Table 7 lists the explanatory variables that were employed. Every variable is a number.
The Chi-square test is performed first to determine if the answer variable is part of the gamma distribution. The test yielded a result of 10.0286 and a p-value of 0.9117. The gamma distribution closely matches this response variable, with an estimated dispersion parameter of 0.0153. Using the predicted dispersion parameter of 0.0153 and log link function to construct the gamma regression model, the test for multicollinearity, the eigenvalues of the matrix are obtained as: The determined condition number of the data is 35422.83 demonstrating the existence of the serious multicollinearity problem.
Table 8 lists the estimated MSE values for the MLE, GRRM, and GGRRM estimators using various estimating matrices. Table 8 makes it abundantly evident that the F approach effectively reduces the value of the calculated coefficients. The MSE has also been considerably decreased in favor of the F method. It is clear that the MSE of the F technique was around 64.97%, 60.63%, 59.33%, 47.98%, 44.34%, 46.79%, 45.37%, 42.43%, 48.26%, 65.96%, 44.34%, 44.47%, and 45.49% lower than that of MLE, GRRM, HK, N, TC, HSL, AH, D, SB, SV1, SV2, M, and AS estimators, respectively.
Table 7: Description of the used explanatory variables
|
Variable name’s
|
Description
|
|
SpMax3_Bh(s)
|
largest eigenvalue n. 3 of Burden matrix weighted by I-state
|
|
P_VSA_e_3
|
P_VSA-like on Sanderson electronegativity, bin 3
|
|
IC3
|
Information Content index (neighborhood symmetry of 3-order)
|
|
Mor21e
|
signal 21 / weighted by Sanderson electronegativity
|
|
MATS2s
|
Moran autocorrelation of lag 2 weighted by I-state
|
|
GATS4p
|
Geary autocorrelation of lag 4 weighted by polarizability
|
|
SpMax8_Bh(p)
|
largest eigenvalue n. 8 of Burden matrix weighted by polarizability
|
|
ATS8v
|
Broto-Moreau autocorrelation of lag 8 (log function) weighted by van der Waals volume
|
|
MATS7v
|
Moran autocorrelation of lag 7 weighted by van der Waals volume
|
|
TDB08m
|
3D Topological distance based descriptors - lag 8 weighted by mass
|
Table 8: The estimated MSE values for the real data application
|
Methods
|
MSE
|
|
MLE
|
4.3291
|
|
GRRM
|
3.8507
|
|
HK
|
3.3008
|
|
N
|
2.9147
|
|
TC
|
2.9348
|
|
F
|
1.5161
|
|
HSL
|
2.8492
|
|
AH
|
2.7751
|
|
D
|
2.6335
|
|
SB
|
2.9301
|
|
SV1
|
2.7571
|
|
SV2
|
2.7238
|
|
M
|
2.7304
|
|
AS
|
2.7816
|