Statistical Analyis for catching of Atlantic cod
OBJECTIVE- The question I'm trying to answer here is: How the mean–depth, square of mean-depth, standard deviation and latitude effects the catching of Atlantic cod in a tow.
DATA- The data consists of the information of 4863 tows 4863 tows in the Gulf of Main and Georges Bank region during fall, 1970-2008, collected by the Northeast Fisheries Sciences Center (NEFSC).
If we look at the summary of the quantitative variables
The mean is slightly greater than median for all four variables, this means that all these four variables are slightly skewed to the right. If we look at the frequency plot of the variable ycount (which stores the counts of cod caught at a tow).
The Frequency plot shows that count data contains almost more than 2000 zeros for 4863 tows. This means we have so many zeros in our data.
MODEL-
Model1- The first model I chose is to check the effect of mean –depth, square of mean-depth, and standard deviation and latitude effects on the variable ybin (which is 1 if the Atlantic cod caught in a given tow and 0 otherwise). The model is:
logπi1−πi=β0+β1(mean−length)+β2(mean−length)2+β3(std−dev)+β4(lattitude)
Where π
i
= E(ybin
i
) and ybin
i
~Bin(1,π
i
)
The results from R as follows:
All the covariates in the model are significant. But before the model interpretation, let's check the model fitting. Although residual deviance is greater than df we can't comment on the model fitting because it is binary data.
Diagnostic
The plots of residuals with fitted values shows a pattern and hence that model is incorrect. Also the plot for residuals with index doesn't show a pattern but aren't random. Hence the model is incorrect. The possible reasons are that may be there are missing covariates, interaction and non-linear terms are missing. So we can try these different models. I'm not going further with this model.
Model2- The next model I tried is Poisson regression to check the effects of mean –depth, square of mean-depth, standard deviation and latitude effects on the counts of Atlantic cod in a tow. The model is
logμ=β0+β1(mean−length)+β2(mean−length)2+β3(std−dev)+β4(lattitude)
Where µ
i
= E (ycount
i
) and ycount~ Poisson (µ
i
)
The results from R
Here all the covariates are significant and if we compare the residual deviance with chi-square at significance level=0.05, then this model isn't a good fit. And if we look at the residual deviance the deviance is much greater than df, This could be the case of overdispersion.
Diagnostic- If we look at the plot of mean vs variance
There are so many observations which are above the straight line shows that variance is greater than mean. This shows that we have the case of overdispersion.
Model3-Earlier we have seen that there are many zeros in the count data, so this is zero inflated count data. The next model I tried is zero inflated negative binomial regression model to counter the overdispersion and zero inflated problem
Here if ycount=0, we don't know whether the Atlantic cod is present here or not. So for zero inflated negative binomial model which is a mixture of Bernoulli and negative binomial distribution.
Y = V (1 − B), where V ∼ negative binomial (µ, α), B ∼ Bin (1, p), and V and B are independent.
Here if ybin=0 then Prob (ycount=0/ybin) =1 and if ybin=1, then Prob (ycount/ybin=1) = Negative-binomial (µ
i,
α). Here α is overdispersion parameter.
The summary of the model is:
Here all the covariates are significant at significance level=0.05.To find out whether the zero inflated negative binomial model fits better than Poisson model, I did the vuong test for non-nested model.
.
Here model-1 is zero inflated negative binomial and model-2 is Poisson model. The test result shows that zero inflated negative binomial fits better than Poisson model.
To test whether dispersion is significant or not. I did the likelihood ratio test
p-value is very low, hence I conclude that there is a significant amount of dispersion in the data.
Diagnosis
From the plot of residuals with index, although the plot doesn't have an apparent pattern but still residuals aren't random. From the plot of residuals with fitted values, the plot between residuals and fitted value doesn't have any pattern except the residuals are aggregated in a flat line. So this means something is wrong with this model which can be improved by adding additional covariates or transformation of covariates.
Interpretation – from the count part the negative coefficients for mean-depth and (mean-depth)^2 and latitude, shows that increase in depth and latitude decrease the log mean of catching a fish. And also increase in latitude decrease the log mean of catching fish. Also increase in standard deviation of the ocean depth increase the log mean of catching fish. From the binary model, negative coefficient of mean-depth and latitude shows that log odds of catching a fish decreases with increase in mean-depth and latitude and positive coefficient for (mean-depth)^2 and standard deviation for ocean depth shows that increase in log odds of catching a fish with increase in (mean-depth)^2 and standard deviation.
CONCLUSION- 1 from the exploratory data analysis, I saw that data is zero inflated. That is counts of number of fishes caught from a tow contains so many zeros.
-
First I fit binary logit model to the binary data and from the model diagnostic found that logistic regression model isn't a good fit. Further study can be done in this direction by adding interaction between covariates, transformation of covariates. The link function can be tested as well.
-
The second model I tried is Poisson regression model with log link. The results show that model isn't a good fit as there is a case of overdispersion in the data.
-
To counter the zero inflation and overdispersion, I applied zero inflated Negative Binomial model to the data and found that it's better than Poisson model and there is a significant overdispersion in data. Although this model is better than other models, the residual plots shows that there may be some missing covariates. Like this data is collected at different regions and at different time intervals. So temporal and spatial effects can be considered.
REFERENCE :- Xia Wang, Ming-Hui Chen, Rita C. Kuo, and Dipak K. Dey :Bayesian Spatial-Temporal Modeling Of Ecological Zero-Inflated Count Data.
Vuong's closeness test: Wikipedia.