Zero-inflated Poisson regression

Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.

require(pscl)
library(boot)
library(ggplot2)

Examples of Zero-Inflated Poisson regression

Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.

Theoretical background

Zero-Inflated Count Regression Overview

The main motivation for zero-inflated count models is that real-life data frequently display overdispersion and excess zeros. Zero-inflated count models provide a way of modeling the excess zeros as well as allowing for overdispersion. In particular, for each observation, there are two possible data generation processes. The result of a Bernoulli trial is used to determine which of the two processes is used. For observation \(i\), Process 1 is chosen with probability \(\varphi_i\) and Process 2 with probability \(1-\varphi_i\). Process 1 generates only zero counts. Process 2 generates counts from either a Poisson or a negative binomial model. In general, \[ Y_i\sim\left\{\begin{array}{ll} 0 & \mbox{with probability }\varphi_i,\\ g(y_i) & \mbox{with probability }1-\varphi_i. \end{array}\right. \] Therefore, the probability of \(\{Y_i=y_i\}\) can be described as \[ \begin{align*} \mathsf{P}[Y_i=0|{\bf x}_i]&=\varphi_i+(1-\varphi_i)g(0),\\ \mathsf{P}[Y_i|{\bf x}_i]&=(1-\varphi_i)g(y_i),\quad y_i>0. \end{align*} \] where \(g(y_i)\) follows either the Poisson or the negative binomial distribution.

When the probability \(\varphi_i\) depends on the characteristics of observation \(i\), \(\varphi_i\) is written as a function of \({\bf z}_i^{\top}{\boldsymbol\gamma}\), where \({\bf z}_i^{\top}\) is the \(1\times (q+1)\) vector of zero-inflated covariates and \({\boldsymbol\gamma}\) is the \((q+1)\times 1\) vector of zero-inflated coefficients to be estimated. (The zero-inflated intercept is \(\gamma_0\); the coefficients for the \(q\) zero-inflated covariates are \(\gamma_1,\ldots,\gamma_q\).) The function \(F\) relating the product \({\bf z}_i^{\top}{\boldsymbol\gamma}\) (which is a scalar) to the probability \(\varphi_i\) is called the zero-inflated link function, \[ \varphi_i=F({\bf z}_i^{\top}{\boldsymbol\gamma}). \] Furthermore, the zero-inflated link function \(F\) can be specified as either the logistic function, \[ F({\bf z}_i^{\top}{\boldsymbol\gamma})=\frac{\exp\{{\bf z}_i^{\top}{\boldsymbol\gamma}\}}{1+\exp\{{\bf z}_i^{\top}{\boldsymbol\gamma}\}} \] or the standard normal cumulative distribution function (also called the probit function), \[ F({\bf z}_i^{\top}{\boldsymbol\gamma})=\Phi({\bf z}_i^{\top}{\boldsymbol\gamma})=\int_0^{{\bf z}_i^{\top}{\boldsymbol\gamma}}\frac{1}{\sqrt{2\pi}}e^{-u^2/2}du. \]

ZIP Regression

In the zero-inflated Poisson (ZIP) regression model, the data generation process referred to earlier as Process 2 is \[ g(y_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}, \] where \(\mu_i=e^{{\bf x}_i^{\top}{\boldsymbol\beta}}\). Thus the ZIP model is defined as \[ \begin{align*} \mathsf{P}[Y_i=0|{\bf x}_i,{\bf z}_i]&=F({\bf z}_i^{\top}{\boldsymbol\gamma})+(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))e^{-\mu_i},\\ \mathsf{P}[Y_i|{\bf x}_i,{\bf z}_i]&=(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!},\quad y_i>0. \end{align*} \]

The conditional expectation and conditional variance of \(Y_i\) are given by \[ \begin{align*} \mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]&=\mu_i(1-F({\bf z}_i^{\top}{\boldsymbol\gamma})),\\ \mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]&=\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i](1+\mu_iF({\bf z}_i^{\top}{\boldsymbol\gamma})). \end{align*} \]

Note that the ZIP model (as well as the ZINB model) exhibits overdispersion since \(\mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]>\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]\).

Wildlife fish data

We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).

In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model. Let's look at the data.

zinb <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/fish.csv")
zinb <- within(zinb, {
    nofish <- factor(nofish)
    livebait <- factor(livebait)
    camper <- factor(camper)
})

summary(zinb)

##  nofish  livebait camper     persons          child      
##  0:176   0: 34    0:103   Min.   :1.000   Min.   :0.000  
##  1: 74   1:216    1:147   1st Qu.:2.000   1st Qu.:0.000  
##                           Median :2.000   Median :0.000  
##                           Mean   :2.528   Mean   :0.684  
##                           3rd Qu.:4.000   3rd Qu.:1.000  
##                           Max.   :4.000   Max.   :3.000  
##        xb                  zg              count        
##  Min.   :-3.275050   Min.   :-5.6259   Min.   :  0.000  
##  1st Qu.: 0.008267   1st Qu.:-1.2527   1st Qu.:  0.000  
##  Median : 0.954550   Median : 0.6051   Median :  0.000  
##  Mean   : 0.973796   Mean   : 0.2523   Mean   :  3.296  
##  3rd Qu.: 1.963855   3rd Qu.: 1.9932   3rd Qu.:  2.000  
##  Max.   : 5.352674   Max.   : 4.2632   Max.   :149.000

## histogram with x axis in log10 scale
ggplot(zinb, aes(count)) + geom_histogram() + scale_x_log10()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

Zero-inflated Poisson Regression - The focus of this web page.
Zero-inflated Negative Binomial Regression - Negative binomial regression does better with over dispersed data, i.e. variance much larger than the mean.
Ordinary Count Models - Poisson or negative binomial models might be more appropriate if there are no excess zeros.
OLS Regression - You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.

Zero-inflated Poisson model

Though we can run a Poisson regression in R using the glm function in one of the core packages, we need another package to run the zero-inflated poisson model. We use the pscl package.

summary(m1 <- zeroinfl(count ~ child + camper | persons, data = zinb))

## 
## Call:
## zeroinfl(formula = count ~ child + camper | persons, data = zinb)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2369 -0.7540 -0.6080 -0.1921 24.0847 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.59789    0.08554  18.680   <2e-16 ***
## child       -1.04284    0.09999 -10.430   <2e-16 ***
## camper1      0.83402    0.09363   8.908   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.2974     0.3739   3.470 0.000520 ***
## persons      -0.5643     0.1630  -3.463 0.000534 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -1032 on 5 Df

The output looks very much like the output from two OLS regressions in R.

Below the model call, you will find a block of output containing Poisson regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.

All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.

mnull <- update(m1, . ~ 1)

pchisq(2 * (logLik(m1) - logLik(mnull)), df = 3, lower.tail = FALSE)

## 'log Lik.' 4.041242e-41 (df=5)

Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a high significant p-value; thus, our overall model is statistically significant.

Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard Poisson regression. We can determine this by running the corresponding standard Poisson model and then performing a Vuong test of the two models.

summary(p1 <- glm(count ~ child + camper, family = poisson, data = zinb))

## 
## Call:
## glm(formula = count ~ child + camper, family = poisson, data = zinb)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7736  -2.2293  -1.2024  -0.3498  24.9492  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.91026    0.08119   11.21   <2e-16 ***
## child       -1.23476    0.08029  -15.38   <2e-16 ***
## camper1      1.05267    0.08871   11.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2958.4  on 249  degrees of freedom
## Residual deviance: 2380.1  on 247  degrees of freedom
## AIC: 2723.2
## 
## Number of Fisher Scoring iterations: 6

vuong(p1, m1)

## Vuong Non-Nested Hypothesis Test-Statistic: 1.947207 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model1 > model2, with p-value 0.025755

The Vuong test compares the zero-inflated model with an ordinary Poisson regression model. In this example, we can see that our test statistic is significant, indicating that the zero-inflated model is superior to the standard Poisson model.

We can get confidence intervals for the parameters and the exponentiated parameters using bootstrapping. For the Poisson model, these would be incident risk ratios, for the zero inflation model, odds ratios. We use the boot package. First, we get the coefficients from our original model to use as start values for the model to speed up the time it takes to estimate. Then we write a short function that takes data and indices as input and returns the parameters we are interested in. Finally, we pass that to the boot function and do 1200 replicates, using snow to distribute across four cores. Note that you should adjust the number of cores to whatever your machine has. Also, for final results, one may wish to increase the number of replications to help ensure stable results.

dput(coef(m1, "count"))

## structure(c(1.59788828690411, -1.04283909332231, 0.834023618148891
## ), .Names = c("(Intercept)", "child", "camper1"))

dput(coef(m1, "zero"))

## structure(c(1.29744027908309, -0.564347365357873), .Names = c("(Intercept)", 
## "persons"))

f <- function(data, i) {
  require(pscl)
  m <- zeroinfl(count ~ child + camper | persons, data = data[i, ],
    start = list(count = c(1.598, -1.0428, 0.834), zero = c(1.297, -0.564)))
  as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2]))
}

set.seed(10)
res <- boot(zinb, f, R = 1200, parallel = "snow", ncpus = 4)

## print results
res

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = zinb, statistic = f, R = 1200, parallel = "snow", 
##     ncpus = 4)
## 
## 
## Bootstrap Statistics :
##         original       bias    std. error
## t1*   1.59788855 -0.056660729  0.30306625
## t2*   0.08553816  0.004256622  0.01670207
## t3*  -1.04283849 -0.002509802  0.40556584
## t4*   0.09998829  0.004395386  0.01539300
## t5*   0.83402218  0.017178157  0.40465431
## t6*   0.09362679  0.004581183  0.01535966
## t7*   1.29743916  0.020810017  0.48058191
## t8*   0.37385225  0.008224433  0.03661806
## t9*  -0.56434716 -0.030102634  0.26673442
## t10*  0.16296381  0.005272244  0.02980715

The results are alternating parameter estimates and standard errors. That is, the first row has the first parameter estimate from our model. The second has the standard error for the first parameter. The third column contains the bootstrapped standard errors, which are considerably larger than those estimated by zeroinfl.

Now we can get the confidence intervals for all the parameters. We start on the original scale with percentile and bias adjusted CIs. We also compare these results with the regular confidence intervals based on the standard errors.

## basic parameter estimates with percentile and bias adjusted CIs
parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) {
  out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"))
  with(out, c(Est = t0, pLL = percent[4], pUL = percent[5],
     bcaLL = bca[4], bcaLL = bca[5]))
}))

## add row names
row.names(parms) <- names(coef(m1))
## print results
parms

##                          Est         pLL         pUL        bcaLL
## count_(Intercept)  1.5978885  0.87925947  2.07809608  1.087354217
## count_child       -1.0428385 -1.75091143 -0.17531081 -1.618508766
## count_camper1      0.8340222  0.05959868  1.62653099  0.001571012
## zero_(Intercept)   1.2974392  0.35031809  2.21983762  0.293576629
## zero_persons      -0.5643472 -1.10869954 -0.07847424 -1.008525816
##                          bcaLL
## count_(Intercept)  2.226142826
## count_child       -0.022034643
## count_camper1      1.599950663
## zero_(Intercept)   2.120703763
## zero_persons       0.006329548

confint(m1)

##                        2.5 %     97.5 %
## count_(Intercept)  1.4302366  1.7655400
## count_child       -1.2388125 -0.8468657
## count_camper1      0.6505185  1.0175288
## zero_(Intercept)   0.5647033  2.0301772
## zero_persons      -0.8837505 -0.2449442

The bootstrapped confidence intervals are considerably wider than the normal based approximation. The bootstrapped CIs are more consistent with the CIs from Stata when using robust standard errors.

Now we can estimate the incident risk ratio (IRR) for the Poisson model and odds ratio (OR) for the logistic (zero inflation) model. This is done using almost identical code as before, but passing a transformation function to the h argument of boot.ci, in this case, exp to exponentiate.

## exponentiated parameter estimates with percentile and bias adjusted CIs
expparms <- t(sapply(c(1, 3, 5, 7, 9), function(i) {
  out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"), h = exp)
  with(out, c(Est = t0, pLL = percent[4], pUL = percent[5],
    bcaLL = bca[4], bcaLL = bca[5]))
}))

## add row names
row.names(expparms) <- names(coef(m1))
## print results
expparms

##                         Est       pLL       pUL     bcaLL     bcaLL
## count_(Intercept) 4.9425854 2.4091162 7.9892442 2.9664152 9.2640640
## count_child       0.3524528 0.1736157 0.8391964 0.1981940 0.9782063
## count_camper1     2.3025614 1.0614130 5.0862000 1.0015722 4.9527881
## zero_(Intercept)  3.6599122 1.4195204 9.2058363 1.3412160 8.3370027
## zero_persons      0.5687313 0.3299878 0.9245259 0.3647563 1.0063496

To better understand our model, we can compute the expected number of fish caught for different combinations of our predictors. In fact, since we are working with essentially categorical predictors, we can compute the expected values for all combinations using the expand.grid function to create all combinations and then the predict function to do it. We also remove any rows where the number of children exceeds the number of persons, which does not make sense logically, using the subset function. Finally we create a graph.

newdata1 <- expand.grid(0:3, factor(0:1), 1:4)
colnames(newdata1) <- c("child", "camper", "persons")
newdata1 <- subset(newdata1, subset=(child<=persons))
newdata1$phat <- predict(m1, newdata1)

ggplot(newdata1, aes(x = child, y = phat, colour = factor(persons))) +
  geom_point() +
  geom_line() +
  facet_wrap(~camper) +
  labs(x = "Number of Children", y = "Predicted Fish Caught")

Things to consider

Since zip has both a count model and a logit model, each of the two models should have good predictors. The two models do not necessarily need to use the same predictors.
Problems of perfect prediction, separation or partial separation can occur in the logistic part of the zero-inflated model.
Count data often use exposure variables to indicate the number of times the event could have happened. You can incorporate exposure into your model by using the exposure() option.
It is not recommended that zero-inflated Poisson models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.
Pseudo-R-squared values differ from OLS R-squareds.

References

UCLA: IDRE (Institute for Digital Research and Education). Data Analysis Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
SAS 9.3 (2013). PROC COUNTREG help page. SAS Institute, Cary NC.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.