The Generalized Estimating Equations Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

where is the intercept, is the regression coefficient for each corresponding predictor variable, and is the error of the prediction. We deviate from ordinary linear regression to logistic regression because the response variable is binary using a suitable transformation. In this case the most suitable transformation is the logistic or logit function of pleading to the model.

The logit of a probability is simply the log of the odds of the response taking the value one. The above Equation (3.12) can be rewritten as

The logistic regression model indirectly models the response variable based on probabilities associated with the values of . The logit function can take any real value, but the associated probability always lies in the required interval. In a logistic regression model, the parameter associated with explanatory variable is such that is the odds that the response variable takes the value one when increases by one, conditional on the other explanatory variables remaining constant. The parameters of the logistic regression model (the vector of regression coefficients are estimated by maximum likelihood.

Logistic regression which is GLM was used but since it cannot account for the correlation we extend the GLM by using GEE to estimate the parameters but specify the correlation structure. GEE is just an implementing model (marginal model). The parameters are estimated using different correlation structures and QIC is used to choose which correlation structure is giving the least QIC using GEE and then fit it to the logistic regression model. LT50 was estimated using repeated measures logistic regression which uses GEE

Consider equation (3.12) having time as an explanatory variable and the response variable as the proportion of mortality

(3.14)

The LT50 is, by definition, the time at which equals 0.5. by substituting with 0.5 in equation (3.14) we get

(3.15)

3.4 Generalized Estimating Equations

To use GEE in estimating, we have three-part specification; the conditional expectation of each response, the conditional variance of each given the covariates and the covariance (correlation) matrix. Let's define the marginal regression model to be:

(3.16)

Where is a vector of covariates, consists of the p regression parameters of interest, is the link function, and denotes the outcome for the subject . In GEE, are not necessarily independent, are not necessarily normal, is the correlation matrix. Common choices for the link function are;

[for count data] ,

The mean model (conditional expectation) can also be defined by:

(3.17)

We then define the model for the correlation.

Assuming no missing data, the covariance matrix for Y is modeled as:

(3.18)

The over-dispersion parameter is estimated using the formula

(3.19)

Where is the number of measurements and is the number of regression parameters. The square root of the over dispersion parameter is called the scale parameter.

The regression model (score) is given by

(3.20)

Where is a glm dispersion parameter - allows for over dispersion, is a diagonal matrix of variance functions , is the matrix of derivatives , is the "working" covariance matrix of , and is the working correlation matrix of .

Given a mean model, , and variance structure, , ("working" covariance matrix of ), the parameter estimates will be given by solving and are typically obtained via the Newton-Raphson algorithm, otherwise we need iterations to solve .

3.4.1 Newton-iteration

Fitting algorithm becomes an instance of the Newton algorithm to solving a system of equations

Compute an initial estimate of Î² from a GLM (i.e.by assuming independence)

Compute an estimate of of the working correlation on the basis of the current Pearson residuals (standardized residuals), the current estimate of and the assumed structure of

Compute an estimate of covariance as

(3.21)

Compute an updated estimate of based on the Newton-step

(3.22)

Repeat/Iterate 2-4 until convergence

GEE works best if the numbers of observations per subject is small and the number of the subjects is large, and also if in the longitudinal studies the measurements are taken at the same time for all subjects.

3.4.2 Estimation of the covariances of

There are two classical ways of estimating the covariance

3.4.2.1 The model based estimate

The model-based estimator of the covariance matrix of Î² is given by

(3.23)

(3.24)

can also be written as

(3.25)

Here consistently estimates if the mean model and the working correlation are correct.

3.4.2.2 Empirical-sandwich estimate

The empirical or robust estimator of the covariance matrix of is given by

(3.26)

(3.27)

can also be written as

(3.28)

Here is a consistent estimator of even if the working correlation is misspecified, i.e, . In computing, and are replaced by estimates, and is replaced by the estimate

3.4.3 The working correlation matrix

Let be an "working" correlation matrix that is fully specified by the vectors of parameters . The covariance matrix of can is modeled as

(3.28)

where is an diagonal matrix with as the diagonal element. If is the true correlation matrix of , then will be the true variance matrix of . The working correlation is usually unknown and must be estimated. It is estimated in the iterative fitting process by using the current value of the parameter vector to compute appropriate functions of the pearson's residuals

(3.29)

Choices for the correlation structure within GEE:

Independent: . For , identity matrix, the GEE reduces to independence estimating equation.

Exchangeable: It's when all measurements on the same units are equally correlated.

Autoregressive, AR(1): At times the correlation depends on time or distance between measurements and . It's possible for repeated measures when the correlation is known to decline over time.

Unstructured correlation: This is when there are no assumptions about the correlations Ï1n.

M-dependent: Pairs of data elements separated by consecutive repeated measurements have a common correlation coefficient.

There are a number of issues guiding the choice of correlation structures. If the number of observations per cluster is small in a balanced and complete design, then an unstructured matrix is recommended. For data sets with miss-timed measurements, then you consider a model where the correlation is a function of the time between observations (i.e., Mdependent or auto-regressive). For data sets with clustered observations, there may be no logical ordering for observations within a cluster and an exchangeable structure may be most appropriate. For both the independence working structure and the fixed working structure, no estimation of Î± is performed (for the fixed structure, the user must specify a matrix mat). The use of the exchangeable (also referred to as compound symmetry) working correlation matrix with measured data and identity link function is equivalent to a random effects model with a random intercept per cluster. The most popular form of inference on GEE regression parameters is the Wald test.

3.4.4 Choosing the Best Model in GEE

There are three ways to choose the best model in GEE, they include the likelihood ratio test which is a chi-square value, the deviance ratio and the information criteria test (Quasi-likelihood information criterion). Quasi-likelihood Information Criterion (QIC) is an extension of the Akaike Information Criterion (AIC) to apply to models fit by GEE. QIC was used to choose the best correlation structure. which is a function of , was used to choose the best correlation structure and was a measure that was used to determine the best subsets of covariates for a particular model. In both case, the best model was the one with the smallest value.

3.4.5 Delta Method

Delta Method was used to estimate the variance so as to get the confidence intervals of the LT50 estimates. This is a method of finding approximations based on Taylor series expansions to the variance of functions of random variables (whose variances are known). For estimators, the delta method is important because while the theoretical variance of a mean of dose is known, there is no general theoretical expression for the variance of most functions of a mean, such as the inverse of a mean, or the ratio of two means. There is no exact expression for the variance of a ratio of two random variables in terms of the variances and covariance of the random variables.

The Taylor series expansion of a function about a value is given as:

(3.30)

where we can often drop the higher order terms to give the approximation:

(3.34)

If we let , the mean of , a Taylor series expansion of about gives the approximation:

(3.31)

Taking the variance of both sides yields:

(3.32)

Therefore, if is any function of a random variable , we need only calculate the variance of and the first derivative of the function to approximate the variance of .

Two-Variable Taylor Series Expansion for random variables of about the values is given by:

(3.33)

(3.34)

where is the correlation between X and Y

(3.35)

(3.36)

(3.37)

therefore,

(3.38)

If we let

then

(3.39)

and hence an approximate 95% confidence interval for the LT50 is given by

(3.40)

3.6 Comparing the two models (GEE and Survival Analysis)

The estimates and the confidence intervals from the two approaches were compared. The method that had the lesser confidence interval was the best method.

4.1 Sample description

The data used in this study was from a lab experiment using 50 subjects of insects on five different concentration levels with five extracts. It was replicated three times at six-twelve hour time intervals. The study used three extracts (B, C, E) and four dose levels (12.5, 50, 250, 500)

4.2 Comparing extracts

Box plots were used to compare the three different extracts as shown in figure 4.1 below

Figure 4.1 Box plot for extracts B, C and E

From the graphical representation in figure 4.1 above, extracts B, C and E are different from each other in terms of insect mortality

Cochrane Q test for three or more matched groups used to test for the differences between the three extracts (B, C and E). From the results, the test statistics Q was 8.9385 and the p-value was 0.03647. Since the p-value was less than the alpha, the null hypothesis for difference between the extracts was rejected and we concluded that the three extracts are significantly different from each other.

4.3 Effectiveness of extract and dose

Logistic regression was used to evaluate the overall effect of the extracts (E, C, B) and dose levels (12.5, 50, 250, 500) at each time point.

72hrs

Intercept

-2.13

(<0.001)

-1.86

(<0.001)

-1.46

(<0.001)

-1.08

(<0.001)

-0.82

(<0.001)

-0.54

(<0.001)

Extract C

-0.25

(<0.001)

-0.07

(<0.001)

-0.38

(<0.001)

-0.37

(<0.001)

-0.34

(<0.001)

-0.37

(<0.001)

Extract E

0.06

(<0.001)

0.15

(<0.001)

-0.02

(<0.001)

-0.01

(<0.001)

0.001

(<0.001)

-0.05

(<0.001)

log(dose)

0.28

(<0.001)

0.23

(<0.001)

0.24

(<0.001)

0.18

(<0.001)

0.14

(<0.001)

0.10

(<0.001)

Table 4.1 Estimates of the regression parameters and corresponding p-values

From Table 4.1 above, the extracts and the doses had significant effect on mortality at all time points (12 hrs, 24hrs, 36hrs, 48hrs, 60hrs, and 72hrs). Extract B was taken as the reference level while making inferences on the levels of extracts and dose levels

The chi-squared test statistics of 86.2 with two degrees of freedom is associated with p-value <0.001 indicated that the overall effect of the extracts was statistically significant. Null deviance was 675.7 on 215 degrees of freedom, Residual deviance was 324.8 on 212 degrees of freedom and the AIC was 1356.4.

When biological responses are plotted against their causal stimuli (or logarithms of them) they often form a sigmoid curve. The predicted probabilities values for extracts B, C, and E showed that the effect on the logit of increasing log(dose) by one unit is the same no matter where the starting point is. The effect on the probability depends on the initial level of log(dose).

Figure 4.2 Predicted probabilities for extracts B, C and E

4.2 LD50 at each time point

The LD50 of the extract at each time point was estimated using logistic regression analysis. Probit analysis and logistic regression are used to model the relationship between a stimulus, like a drug, and a response that may be either success or failure. Logistic regression is one of the standard method of analyzing dose response data where by probit and logit sigmoid dose/stimulus response curves are fitted and for standard errors and confidence intervals for dose-response such as LD50 are calculated.

The data set was sorted by different time intervals for the extract and for the dose levels. We used time at 12hrs as an example to show how to conduct the analysis. The model was fitted and the estimates for the extracts C (-0.4381961), E (0.1267465), log-dose (0.4723418) and the intercept (-2.3754586) are given with extract B taken as the reference level. The log-LD50 and the 95% confidence interval were then calculated from where the results were exponentiated to get the LD50 estimate for extracts C, B and E. the same procedure was then repeated for the other time intervals and the results were summarized in table 4.2 below.

72hrs

Extracts

LD50

(95% CI)

LD50

(95% CI)

LD50

(95% CI)

LD50

(95% CI)

LD50

(95% CI)

LD50

(95% CI)

B

152.80

(106.29-219.66)

133.62

(90.13-198.10)

17.87

(13.20-24.20)

8.85

(5.99-13.06)

4.54

(2.91-7.09)

1.69

(0.91-3.14)

C

386.38

(256.96-580.99)

185.14

(123.73-277.03)

101.31

(78.70-130.41)

68.41

(51.81-90.31)

44.48

(39.93-58.29)

32.02

(23.93-42.85)

E

116.84

(81.64-167.21)

64.16

(43.18-95.35)

19.45

(14.43-26.22)

9.16

(6.22-13.48)

4.44

(2.84-6.96)

2.30

(1.30-4.07)

Table 4.2 LD50 estimates at each time point together with their CI

Table 4.2 above shows the LD50 estimates for each extract together with their 95% CI at each time point. These estimates are not of importance because the correlation aspect of time has not been accounted for. But since we are interested in the speed of kill, these LD50 estimates for the different extracts will not give meaningful results to us. From the above table extract B at 72hrs is stronger since it has the lowest LD50 value. LD50 value for extract C at 12 hours has the highest value hence it's the less effective dose. Generally the LD50 values ranged between 1.69 and 386.38. These results have been estimated while ignoring the correlation aspects of the time.

Since we are interested in the speed of kill, estimating LD50 at each time point is not important, we therefore go ahead and estimate LT50 but the analysis should not be at each time point.

4.3 Estimating LT50 using repeated measures logistic regression (GEE)

Logistic regression model for repeated measures using GEE was used because there was correlation of observations due to time. GEE is an implementing method for non-normal data where observations are correlated and is used to estimate the parameters and . Unstructured correlation matrix was used to extract the variances and the covariances. These parameter estimates are then used to construct the confidence interval using delta method

The data set was sorted by the extract and for each dose level and extract C dose level 12.5 (appendix 2) was used as our example to illustrate the repeated measure logistic regression procedure. The correlation structure for the variance covariance matrix used was the unstructured and since the data was binary it belongs to the binomial family and the link function used was the logit. From the fitted model, the estimates of the intercept and time are given. The estimate of Intercept was alpha while the estimate of time was the beta. In this example -2.0581 was alpha while 0.0280 was beta.

The variance covariance (empirical) matrix was extracted from the regression with the parameters labeled as variance of alpha, covariance of alpha and beta, and variance of beta. In this case the variance covariance matrix was

Where, is 0.013143, is -0.000233 and variance of beta is 0.00000444

We use these parameters to estimate LT50, standard deviation of LT50 and the lower and the upper limits for LT50

LT50 estimate was 74.2

Where

The standard deviation of LT50 was found to be equal to 1.98. The lower and upper limits for LT50 was then calculated using LT50 value and SD(LT50) value and was found to be 70.3 for lower limit and 78.1 for upper limit. The procedure was then repeated for the remaining doses for extract C and other doses for extracts B and E. the table below gives the summary of the analysis

95% CI for LT50

B

12.5

52.1

50.5 - 53.7

B

50

23.0

16.8 - 29.3

B

250

42.9

13.2 - 72.6

B

500

10.3

1.51 - 19.1

C

12.5

74.2

70.3 - 78.1

C

50

43.4

42.0 - 44.7

C

250

21.5

19.1 - 23.9

C

500

7.2

4.3 - 10.1

E

12.5

55..0

52.6 - 57.3

E

50

16.6

11.57 - 21.7

E

250

12.2

6.16 - 18.2

E

500

10.3

9.5 - 11.3

Table 4.3 LT50 Estimates from repeated measures logistic using GEE

The lethal time (LT50) ranged between 10.3 hrs to 52.1 hrs for extract B; 7.2 hrs to 74.2 hrs for extract C and between 10.3 hrs to 55 hrs for extract E. The LT50 values for the different concentration levels ranged between 52.1 hrs to 74.2 hrs for concentration 12.5; 16.6 hrs to 43.4 hrs for concentration 50; 12.2 hrs to 42.9 hrs for concentration 250; and between 7.2 hrs to 10.3 hrs for concentration 500. GEE model fits better estimates because the standard errors are smaller and the confidence levels are not wider.

From table 4.3 above dose 500 was the best dose since it has the lower LT50 value across all the extracts.

4.4 Estimating Median Survival time using Kaplan-Meier

Kaplan-Meier estimators, one of the Survival Analysis techniques was used to determine the median survival time for each dose in each extract since the event of interest was time to death.

The data set was sorted by extract and the analysis stated with creating survival object (appendix 1). The interest was time after insects' exposure to different dose levels and the status of the insects whether dead or alive. In our case when an event occurred then censoring was equal to zero. The groups were the different dose levels for the different extracts and we compared how their median survival times differed. This was shown graphically by figures 4.2, 4.3 and 4.4. Formal test for the differences in the survival curves using logrank test was done. The table summary for the analysis and figures are presented below

The median survival time was taken as the time taken to kill half of the population as presented graphically in the figures below

Figure 4.3 Kaplan-Meier survival plots for extracts B

Figure 4.4 Kaplan-Meier survival plots for extracts C

Figure 4.5 Kaplan-Meier survival plots for extract E

The above figures (4.1 to 4.3) show the Kaplan-Meier survival plots for the different concentrations in extracts B, C and E. In all the cases, concentration 500 is the best dose since it falls faster in the graph meaning that it's taking shorter time to kill than other doses. The log-rank test was used to find out how the different survival curves are different from each other. In the three different plots for the three different extracts, the log-rank test for the survival curves showed that the survival curves were significantly different from each other since p-value < 0.0001 for each of the extracts. Table 4.4 below gives the summary table for the median survival time estimates

95% CI for median

B

12.5

60

(48-NA)

B

50

36

(12-36)

B

250

12

(12-36)

B

500

12

(12-36)

C

12.5

72

(60-NA)

C

50

54

(24-NA)

C

250

30

(12-NA)

C

500

24

(12-36)

E

12.5

60

(48-NA)

E

50

24

(12-36)

E

250

12

(12-36)

E

500

12

(12-24)

Table 4.4 Median Survival Time (LT50) from Kaplan-Meier Estimator

From table 4.4 above, dose 500 was the best since it takes the shortest time to kill 50% of the population in all the extracts.

4.5 Comparing GEE and Survival Estimates

The estimates and the confidence intervals of the estimated parameters from the two techniques were compared as shown in table 4.5 below.

Survival (Kaplan-Mier)

GEE

Extract

Dose

Estimate

95% CI

Estimate

95% CI

B

12.5

60

(48-NA)

52.1

50.5 - 53.7

B

50

36

(12-36)

23.0

16.8 - 29.3

B

250

12

(12-36)

42.9

13.2 - 72.6

B

500

12

(12-36)

10.3

1.51 - 19.1

C

12.5

72

(60-NA)

74.2

70.3 - 78.1

C

50

54

(24-NA)

43.4

42.0 - 44.7

C

250

30

(12-NA)

21.5

19.1 - 23.9

C

500

24

(12-36)

7.2

4.3 - 10.1

E

12.5

60

(48-NA)

55..0

52.6 - 57.3

E

50

24

(12-36)

16.6

11.57 - 21.7

E

250

12

(12-36)

12.2

6.16 - 18.2

E

500

12

(12-24)

10.3

9.5 - 11.3

Table 4.5 Comparing GEE and survival estimates

Survival analysis gave unbounded and larger confidence intervals compared to those of GEE. The standard errors for GEE were better than those of the Survival as shown by the ranges of their confidence intervals. In summary, bigger standard errors showed that the estimates were not better.

5.1 Discussion

This paper discusses survival analysis and repeated measures logistic regression using GEE approaches for estimating LT50 in repeated measures dose response studies. The study found out that probit analysis at each time point give results which do not take into account the correlation aspects of the data. The study also found out that time is a correlation factor since several measurements are taken at different time intervals. When time is a correlated and if we are interested in the speed of kill, then estimating LD50 at each time point doesn't give correct estimates.

The LT50 estimated corresponds to specific extracts and the different concentration levels for the doses. In both approaches dose 500 was the best since it took shorter time to kill half of the insects' population. There is a strong body of knowledge regarding the lethal effects of doses in that the stronger the concentration level the more effective the dose is and this might have reflected the estimated LT50 for the different doses. Further research should be done to ascertain the claim of the estimated LT50 to rule out if the estimates may have been affected by some other factor.

The results of the study were compared with results from the same methods but applied in a different setting to show that the methods were versatile for analyzing repeated measures dose response data. The results of the study were compared with those from the standard dose measurement (logistic and time-dose-mortality model) and found that the two methods (GEE and survival) estimated LT50 just as the time-dose-mortality model and logistic regression in terms of using parameters such as time, dose and proportion of dead insects. However, the issue of time as a correlated factor was not taken into account. This showed that the study methods estimated LT50 by taking into account the correlation aspect while the other methods estimated LT50 at each time points. The same results also showed that higher concentration doses are more effective than lower concentration doses.

The LT50 results from the two study methods are similar to those from LT50 estimation method which were used by Bugeme et al. (2009) and Borsuk et al (2002) since they used repeated measure logistic regression using GEE and Kaplan-Meier survival function. They were used to estimate the LT50 for virulence and survival time for the honey bees respectively. The LT50, the median survival time and confidence levels of the estimates were given. The graphical representation of the estimates of the median survival time was also the same as those shown by Borsuk et al (2002) and Hansen and Lambert, (2003). In all the extracts the log-rank test was used to test for the differences in the different survival curves as used by Hansen and Lambert, (2003). Form the study results all the different survival curves were significantly different from each other.

Just as other approaches of estimating doses, this research approach started by first evaluating if the extracts were significantly different from each other and if the doses and the extracts had significant effect on insect mortality (Finney, 1971; Preisler and Robertson, 1989). Only significant extracts and doses can be used to estimate the doses. If the extracts and the doses had effect on insect mortality then the LD50 estimates are estimated at different time points. Logistic regression which used logit or probit as a link function produced different estimates of LD50 for the three extract B, C and E at each time points as used by Sahaf and Moharrmipour (2008). But since the interest is in the speed of kill the LD50 estimates do not end up giving meaningful results since it ignores the correlation aspect which is brought about by the time, hence need to estimate LT50 using GEE and survival analysis.

The data sets were binary in nature since we are only interested if the insect was dead or not. This therefore enabled us to use the logistic regression but also since time is correlation factor we used the GEE to get the estimates of alpha and beta which are then used to estimate the LT50. Delta method also uses the estimates to extract the variance covariance matrix which are then used to calculate variances. The estimated variances were then used to construct the confidence intervals of the estimates. Survival analysis was used because the interest was the time to event (insect mortality) and since the data was skewed we used the non-parametric approach for estimating the median survival time. In both cases the datasets were prepared differently for the different analysis.

The two study methods were effective since they took into account correlation aspect in account while estimating the LT50 and estimate the time taken by each dose to kill half of the insects' population. In other words they seem effective if the interest is in the speed of kill.

This research study is uniquely different from other approaches of estimating LT50. This is due to the procedure of analysis has been shown using R free statistical software. There is a step-wise procedure on how to analysis repeated measures

The estimates of Kaplan-Meier and repeated measures GEE were compared in terms of confidence intervals and the estimates themselves. Repeated measures logistic regression gave the smaller/ lesser confidence interval compared to those form the Kaplan-Meier estimate. The confidence intervals from the Kaplan-Meier estimate gave unbounded intervals. This therefore means that GEE has better estimates but further research should be done to ascertain our claim.

The implications of these findings are that there is need to improve on the way repeated measures data is being analyzed by adopting or using the methods (GEE or Kaplan-Meier) in analysis. In addition, the kind of the information will be important in guiding how to analyze and interpret results from repeated measures data

Since LT50 is the problem of estimating the duration of time of kill by different doses, then survival analysis and GEE can be used effectively to estimates the effect of doses in dose response studies. The only methods that are usually used estimates LD50 at each time point which ignores the correlation aspect brought about by time when different time points are used.

One criticism of these approaches is that the exact times of kill is not known since time is used cumulatively to estimate if the insect has been killed at a particular time point. To address this concern effective data collection methods and use of existing methods of estimating LT50 should be used in a complementary fashion.

In this study one survival analysis method was used (Kaplan-Meier) to estimate LT50. In addition, unstructured correlation matrix was the only one used in repeated measures logistic regression using GEE. Wider comparisons should be considered to make the research more representative.

5.2 Conclusions

The study has shown how recent methodological advances in estimating LT50 can be applied to some repeated measures data sets. The standards methods for estimating doses only give estimates at each time point thus ignoring the correlation aspect brought about by time.

The study results demonstrate that repeated measures logistic regression using GEE and Kaplan-Meier survival function provide a highly useful alternative for analysis of repeated measures dose response data.

It's good in the era of modern statistics to explore different methods that give better estimates and use better terms in explaining the analysis in details. The analysis indicated that repeated measures logistic regression is a better method of estimating LT50 since it gave estimates with a smaller gap in their confidence interval.

5.3 Recommendation

Kaplan-Meier survival function and repeated measures logistic regression using GEE are effective tools for analyzing repeated measures dose response data.

Since repeated measures logistic regression using GEE in estimating LT50 has given estimates with smaller confidence level while taking into account the correlation aspect of time; it should therefore be used in complementary fashion together with other existing methods in estimating LT50 so as to get meaningful results.

For further research one should try and explore other regression methods which incorporate GEE in estimating LT50. Future studies should also consider other correlation matrices in GEE so as to have a wider range of comparison. Comparison of estimates from two survival analysis methods (Kaplan-Meier estimates and the Aalen-Nelson estimates) should also be considered in future studies.