Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.
require(pscl) library(boot) library(ggplot2)
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.
The main motivation for zero-inflated count models is that real-life data frequently display overdispersion and excess zeros. Zero-inflated count models provide a way of modeling the excess zeros as well as allowing for overdispersion. In particular, for each observation, there are two possible data generation processes. The result of a Bernoulli trial is used to determine which of the two processes is used. For observation \(i\), Process 1 is chosen with probability \(\varphi_i\) and Process 2 with probability \(1-\varphi_i\). Process 1 generates only zero counts. Process 2 generates counts from either a Poisson or a negative binomial model. In general, \[ Y_i\sim\left\{\begin{array}{ll} 0 & \mbox{with probability }\varphi_i,\\ g(y_i) & \mbox{with probability }1-\varphi_i. \end{array}\right. \] Therefore, the probability of \(\{Y_i=y_i\}\) can be described as \[ \begin{align*} \mathsf{P}[Y_i=0|{\bf x}_i]&=\varphi_i+(1-\varphi_i)g(0),\\ \mathsf{P}[Y_i|{\bf x}_i]&=(1-\varphi_i)g(y_i),\quad y_i>0. \end{align*} \] where \(g(y_i)\) follows either the Poisson or the negative binomial distribution.
When the probability \(\varphi_i\) depends on the characteristics of observation \(i\), \(\varphi_i\) is written as a function of \({\bf z}_i^{\top}{\boldsymbol\gamma}\), where \({\bf z}_i^{\top}\) is the \(1\times (q+1)\) vector of zero-inflated covariates and \({\boldsymbol\gamma}\) is the \((q+1)\times 1\) vector of zero-inflated coefficients to be estimated. (The zero-inflated intercept is \(\gamma_0\); the coefficients for the \(q\) zero-inflated covariates are \(\gamma_1,\ldots,\gamma_q\).) The function \(F\) relating the product \({\bf z}_i^{\top}{\boldsymbol\gamma}\) (which is a scalar) to the probability \(\varphi_i\) is called the zero-inflated link function, \[ \varphi_i=F({\bf z}_i^{\top}{\boldsymbol\gamma}). \] Furthermore, the zero-inflated link function \(F\) can be specified as either the logistic function, \[ F({\bf z}_i^{\top}{\boldsymbol\gamma})=\frac{\exp\{{\bf z}_i^{\top}{\boldsymbol\gamma}\}}{1+\exp\{{\bf z}_i^{\top}{\boldsymbol\gamma}\}} \] or the standard normal cumulative distribution function (also called the probit function), \[ F({\bf z}_i^{\top}{\boldsymbol\gamma})=\Phi({\bf z}_i^{\top}{\boldsymbol\gamma})=\int_0^{{\bf z}_i^{\top}{\boldsymbol\gamma}}\frac{1}{\sqrt{2\pi}}e^{-u^2/2}du. \]
In the zero-inflated Poisson (ZIP) regression model, the data generation process referred to earlier as Process 2 is \[ g(y_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}, \] where \(\mu_i=e^{{\bf x}_i^{\top}{\boldsymbol\beta}}\). Thus the ZIP model is defined as \[ \begin{align*} \mathsf{P}[Y_i=0|{\bf x}_i,{\bf z}_i]&=F({\bf z}_i^{\top}{\boldsymbol\gamma})+(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))e^{-\mu_i},\\ \mathsf{P}[Y_i|{\bf x}_i,{\bf z}_i]&=(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!},\quad y_i>0. \end{align*} \]
The conditional expectation and conditional variance of \(Y_i\) are given by \[ \begin{align*} \mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]&=\mu_i(1-F({\bf z}_i^{\top}{\boldsymbol\gamma})),\\ \mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]&=\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i](1+\mu_iF({\bf z}_i^{\top}{\boldsymbol\gamma})). \end{align*} \]
Note that the ZIP model (as well as the ZINB model) exhibits overdispersion since \(\mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]>\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]\).
We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count
), how many children were in the group (child
), how many people were in the group (persons), and whether or not they brought a camper to the park (camper
).
In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child
, persons
, and camper
in our model. Let's look at the data.
zinb <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/fish.csv") zinb <- within(zinb, { nofish <- factor(nofish) livebait <- factor(livebait) camper <- factor(camper) }) summary(zinb)
## nofish livebait camper persons child ## 0:176 0: 34 0:103 Min. :1.000 Min. :0.000 ## 1: 74 1:216 1:147 1st Qu.:2.000 1st Qu.:0.000 ## Median :2.000 Median :0.000 ## Mean :2.528 Mean :0.684 ## 3rd Qu.:4.000 3rd Qu.:1.000 ## Max. :4.000 Max. :3.000 ## xb zg count ## Min. :-3.275050 Min. :-5.6259 Min. : 0.000 ## 1st Qu.: 0.008267 1st Qu.:-1.2527 1st Qu.: 0.000 ## Median : 0.954550 Median : 0.6051 Median : 0.000 ## Mean : 0.973796 Mean : 0.2523 Mean : 3.296 ## 3rd Qu.: 1.963855 3rd Qu.: 1.9932 3rd Qu.: 2.000 ## Max. : 5.352674 Max. : 4.2632 Max. :149.000
## histogram with x axis in log10 scale ggplot(zinb, aes(count)) + geom_histogram() + scale_x_log10()
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
Though we can run a Poisson regression in R using the glm
function in one of the core packages, we need another package to run the zero-inflated poisson model. We use the pscl
package.
summary(m1 <- zeroinfl(count ~ child + camper | persons, data = zinb))
## ## Call: ## zeroinfl(formula = count ~ child + camper | persons, data = zinb) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -1.2369 -0.7540 -0.6080 -0.1921 24.0847 ## ## Count model coefficients (poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.59789 0.08554 18.680 <2e-16 *** ## child -1.04284 0.09999 -10.430 <2e-16 *** ## camper1 0.83402 0.09363 8.908 <2e-16 *** ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.2974 0.3739 3.470 0.000520 *** ## persons -0.5643 0.1630 -3.463 0.000534 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 12 ## Log-likelihood: -1032 on 5 Df
Below the model call, you will find a block of output containing Poisson regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.
All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.
mnull <- update(m1, . ~ 1) pchisq(2 * (logLik(m1) - logLik(mnull)), df = 3, lower.tail = FALSE)
## 'log Lik.' 4.041242e-41 (df=5)
Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard Poisson regression. We can determine this by running the corresponding standard Poisson model and then performing a Vuong test of the two models.
summary(p1 <- glm(count ~ child + camper, family = poisson, data = zinb))
## ## Call: ## glm(formula = count ~ child + camper, family = poisson, data = zinb) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.7736 -2.2293 -1.2024 -0.3498 24.9492 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.91026 0.08119 11.21 <2e-16 *** ## child -1.23476 0.08029 -15.38 <2e-16 *** ## camper1 1.05267 0.08871 11.87 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 2958.4 on 249 degrees of freedom ## Residual deviance: 2380.1 on 247 degrees of freedom ## AIC: 2723.2 ## ## Number of Fisher Scoring iterations: 6
vuong(p1, m1)
## Vuong Non-Nested Hypothesis Test-Statistic: 1.947207 ## (test-statistic is asymptotically distributed N(0,1) under the ## null that the models are indistinguishible) ## in this case: ## model1 > model2, with p-value 0.025755
We can get confidence intervals for the parameters and the exponentiated parameters using bootstrapping. For the Poisson model, these would be incident risk ratios, for the zero inflation model, odds ratios. We use the boot
package. First, we get the coefficients from our original model to use as start values for the model to speed up the time it takes to estimate. Then we write a short function that takes data and indices as input and returns the parameters we are interested in. Finally, we pass that to the boot
function and do 1200 replicates, using snow to distribute across four cores. Note that you should adjust the number of cores to whatever your machine has. Also, for final results, one may wish to increase the number of replications to help ensure stable results.
dput(coef(m1, "count"))
## structure(c(1.59788828690411, -1.04283909332231, 0.834023618148891 ## ), .Names = c("(Intercept)", "child", "camper1"))
dput(coef(m1, "zero"))
## structure(c(1.29744027908309, -0.564347365357873), .Names = c("(Intercept)", ## "persons"))
f <- function(data, i) { require(pscl) m <- zeroinfl(count ~ child + camper | persons, data = data[i, ], start = list(count = c(1.598, -1.0428, 0.834), zero = c(1.297, -0.564))) as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2])) } set.seed(10) res <- boot(zinb, f, R = 1200, parallel = "snow", ncpus = 4) ## print results res
## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = zinb, statistic = f, R = 1200, parallel = "snow", ## ncpus = 4) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 1.59788855 -0.056660729 0.30306625 ## t2* 0.08553816 0.004256622 0.01670207 ## t3* -1.04283849 -0.002509802 0.40556584 ## t4* 0.09998829 0.004395386 0.01539300 ## t5* 0.83402218 0.017178157 0.40465431 ## t6* 0.09362679 0.004581183 0.01535966 ## t7* 1.29743916 0.020810017 0.48058191 ## t8* 0.37385225 0.008224433 0.03661806 ## t9* -0.56434716 -0.030102634 0.26673442 ## t10* 0.16296381 0.005272244 0.02980715
zeroinfl
.
Now we can get the confidence intervals for all the parameters. We start on the original scale with percentile and bias adjusted CIs. We also compare these results with the regular confidence intervals based on the standard errors.
## basic parameter estimates with percentile and bias adjusted CIs parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca")) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4], bcaLL = bca[5])) })) ## add row names row.names(parms) <- names(coef(m1)) ## print results parms
## Est pLL pUL bcaLL ## count_(Intercept) 1.5978885 0.87925947 2.07809608 1.087354217 ## count_child -1.0428385 -1.75091143 -0.17531081 -1.618508766 ## count_camper1 0.8340222 0.05959868 1.62653099 0.001571012 ## zero_(Intercept) 1.2974392 0.35031809 2.21983762 0.293576629 ## zero_persons -0.5643472 -1.10869954 -0.07847424 -1.008525816 ## bcaLL ## count_(Intercept) 2.226142826 ## count_child -0.022034643 ## count_camper1 1.599950663 ## zero_(Intercept) 2.120703763 ## zero_persons 0.006329548
confint(m1)
## 2.5 % 97.5 % ## count_(Intercept) 1.4302366 1.7655400 ## count_child -1.2388125 -0.8468657 ## count_camper1 0.6505185 1.0175288 ## zero_(Intercept) 0.5647033 2.0301772 ## zero_persons -0.8837505 -0.2449442
Now we can estimate the incident risk ratio (IRR) for the Poisson model and odds ratio (OR) for the logistic (zero inflation) model. This is done using almost identical code as before, but passing a transformation function to the h
argument of boot.ci
, in this case, exp
to exponentiate.
## exponentiated parameter estimates with percentile and bias adjusted CIs expparms <- t(sapply(c(1, 3, 5, 7, 9), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"), h = exp) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4], bcaLL = bca[5])) })) ## add row names row.names(expparms) <- names(coef(m1)) ## print results expparms
## Est pLL pUL bcaLL bcaLL ## count_(Intercept) 4.9425854 2.4091162 7.9892442 2.9664152 9.2640640 ## count_child 0.3524528 0.1736157 0.8391964 0.1981940 0.9782063 ## count_camper1 2.3025614 1.0614130 5.0862000 1.0015722 4.9527881 ## zero_(Intercept) 3.6599122 1.4195204 9.2058363 1.3412160 8.3370027 ## zero_persons 0.5687313 0.3299878 0.9245259 0.3647563 1.0063496
expand.grid
function to create all combinations and then the predict
function to do it. We also remove any rows where the number of children exceeds the number of persons, which does not make sense logically, using the subset
function. Finally we create a graph.
newdata1 <- expand.grid(0:3, factor(0:1), 1:4) colnames(newdata1) <- c("child", "camper", "persons") newdata1 <- subset(newdata1, subset=(child<=persons)) newdata1$phat <- predict(m1, newdata1)
ggplot(newdata1, aes(x = child, y = phat, colour = factor(persons))) + geom_point() + geom_line() + facet_wrap(~camper) + labs(x = "Number of Children", y = "Predicted Fish Caught")
exposure()
option.