Zero-inflated negative binomial regression

Zero-inflated negative binomial regression is for modeling count variables with excessive zeros and it is usually for overdispersed count outcome variables. Furthermore, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently.

require(pscl)
## Loading required package: pscl
## Classes and Methods for R developed in the
## 
## Political Science Computational Laboratory
## 
## Department of Political Science
## 
## Stanford University
## 
## Simon Jackman
## 
## hurdle and zeroinfl functions by Achim Zeileis
library(MASS)
library(boot)
library(ggplot2)

Examples of Zero-Inflated negative binomial regression

Theoretical background

ZINB Regression

A zero-inflated model assumes that zero outcome is due to two different processes. For instance, in the example of fishing presented here, the two processes are that a subject has gone fishing vs. not gone fishing. If not gone fishing, the only outcome possible is zero. If gone fishing, it is then a count process. The two parts of the a zero-inflated model are a binary model, usually a logit model to model which of the two processes the zero outcome is associated with and a count model, in this case, a negative binomial model, to model the count process.

The zero-inflated negative binomial (ZINB) model is based on the negative binomial model with quadratic variance function (p=2), i.e., NB2. The ZINB model is obtained by specifying a negative binomial distribution for the data generation process referred to earlier as Process 2: \[ g(y_i)=\frac{\Gamma(y_i+\alpha^{-1})}{y_i!\Gamma(\alpha^{-1})}\left(\frac{\alpha^{-1}}{\alpha^{-1}+\mu_i}\right)^{\alpha^{-1}}\left(\frac{\mu_i}{\alpha^{-1}+\mu_i}\right)^{y_i},\quad y_i=0,1,2,\ldots \] Thus the ZINB model is defined to be \[ \begin{align*} \mathsf{P}[Y_i=0|{\bf x}_i,{\bf z}_i]&=F({\bf z}_i^{\top}{\boldsymbol\gamma})+(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))(1+\alpha\mu_i)^{-\alpha^{-1}},\\ \mathsf{P}[Y_i|{\bf x}_i,{\bf z}_i]&=(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))\frac{\Gamma(y_i+\alpha^{-1})}{y_i!\Gamma(\alpha^{-1})}\left(\frac{\alpha^{-1}}{\alpha^{-1}+\mu_i}\right)^{\alpha^{-1}}\left(\frac{\mu_i}{\alpha^{-1}+\mu_i}\right)^{y_i},\quad y_i>0. \end{align*} \] In this case, the conditional expectation and conditional variance of \(Y_i\) are \[ \begin{align*} \mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]&=\mu_i(1-F({\bf z}_i^{\top}{\boldsymbol\gamma})),\\ \mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]&=\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i](1+\mu_i(F({\bf z}_i^{\top}{\boldsymbol\gamma}+\alpha))). \end{align*} \] As with the ZIP model, the ZINB model exhibits overdispersion because the conditional variance exceeds the conditional mean.

Finally, note that R does not estimate \(\alpha\) but \(\theta\), the inverse of \(\alpha\).

Wildlife fish data revised

Let's pursue Example 2 from above.

We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).

In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model. Let's look at the data.

zinb <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/fish.csv")
zinb <- within(zinb, {
  nofish <- factor(nofish)
  livebait <- factor(livebait)
  camper <- factor(camper)
})

summary(zinb)
##  nofish  livebait camper     persons          child      
##  0:176   0: 34    0:103   Min.   :1.000   Min.   :0.000  
##  1: 74   1:216    1:147   1st Qu.:2.000   1st Qu.:0.000  
##                           Median :2.000   Median :0.000  
##                           Mean   :2.528   Mean   :0.684  
##                           3rd Qu.:4.000   3rd Qu.:1.000  
##                           Max.   :4.000   Max.   :3.000  
##        xb                  zg              count        
##  Min.   :-3.275050   Min.   :-5.6259   Min.   :  0.000  
##  1st Qu.: 0.008267   1st Qu.:-1.2527   1st Qu.:  0.000  
##  Median : 0.954550   Median : 0.6051   Median :  0.000  
##  Mean   : 0.973796   Mean   : 0.2523   Mean   :  3.296  
##  3rd Qu.: 1.963855   3rd Qu.: 1.9932   3rd Qu.:  2.000  
##  Max.   : 5.352674   Max.   : 4.2632   Max.   :149.000

## histogram with x axis in log10 scale
ggplot(zinb, aes(count, fill = camper)) +
  geom_histogram() +
  scale_x_log10() +
  facet_grid(camper ~ ., margins=TRUE, scales="free_y")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk zinb-hist

Analysis methods you might consider

Before we show how you can analyze this with a zero-inflated negative binomial analysis, let's consider some other methods that you might use.

Zero-inflated NB model

Now let's build up our model. We are going to use the variables child and camper to model the count in the part of negative binomial model and the variable persons in the logit part of the model. We use the pscl to run a zero-inflated negative binomial regression. We begin by estimating the model with the variables of interest.

m1 <- zeroinfl(count ~ child + camper | persons, data = zinb, dist = "negbin", 
    EM = TRUE)
summary(m1)
## 
## Call:
## zeroinfl(formula = count ~ child + camper | persons, data = zinb, 
##     dist = "negbin", EM = TRUE)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5861 -0.4617 -0.3886 -0.1974 18.0129 
## 
## Count model coefficients (negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3711     0.2561   5.353 8.63e-08 ***
## child        -1.5152     0.1956  -7.747 9.42e-15 ***
## camper1       0.8790     0.2693   3.264   0.0011 ** 
## Log(theta)   -0.9854     0.1759  -5.600 2.14e-08 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.6028     0.8363   1.916   0.0553 .
## persons      -1.6663     0.6790  -2.454   0.0141 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.3733 
## Number of iterations in BFGS optimization: 2 
## Log-likelihood: -432.9 on 6 Df

The output looks very much like the output from two OLS regressions in R.

Below the model call, you will find a block of output containing negative binomial regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.

All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.

m0 <- update(m1, . ~ 1)

pchisq(2 * (logLik(m1) - logLik(m0)), df = 3, lower.tail = FALSE)
## 'log Lik.' 1.280471e-13 (df=6)

From the output above, we can see that our overall model is statistically significant.

Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard negative binomial regression. We can determine this by running the corresponding standard negative binomial model and then performing a Vuong test of the two models. We use the MASS package to run the standard negative binomial regression.

summary(m2 <- glm.nb(count ~ child + camper, data = zinb))
## 
## Call:
## glm.nb(formula = count ~ child + camper, data = zinb, init.theta = 0.2552931119, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3141  -1.0361  -0.7266  -0.1720   4.0163  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0727     0.2425   4.424 9.69e-06 ***
## child        -1.3753     0.1958  -7.025 2.14e-12 ***
## camper1       0.9094     0.2836   3.206  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2553) family taken to be 1)
## 
##     Null deviance: 258.93  on 249  degrees of freedom
## Residual deviance: 201.89  on 247  degrees of freedom
## AIC: 887.42
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2553 
##           Std. Err.:  0.0329 
## 
##  2 x log-likelihood:  -879.4210
vuong(m1, m2)
## Vuong Non-Nested Hypothesis Test-Statistic: -3.819662 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 6.6817e-05

We can get confidence intervals for the parameters and the exponentiated parameters using bootstrapping. For the negative binomial model, these would be incident risk ratios, for the zero inflation model, odds ratios. We use the boot package. First, we get the coefficients from our original model to use as start values for the model to speed up the time it takes to estimate. Then we write a short function that takes data and indices as input and returns the parameters we are interested in. Finally, we pass that to the boot function and do 1200 replicates, using snow to distribute across four cores. Note that you should adjust the number of cores to whatever your machine has. Also, for final results, one may wish to increase the number of replications to help ensure stable results.

dput(round(coef(m1, "count"), 4))
## structure(c(1.3711, -1.5152, 0.879), .Names = c("(Intercept)", 
## "child", "camper1"))
dput(round(coef(m1, "zero"), 4))
## structure(c(1.6028, -1.6663), .Names = c("(Intercept)", "persons"
## ))
f <- function(data, i) {
  require(pscl)
  m <- zeroinfl(count ~ child + camper | persons,
    data = data[i, ], dist = "negbin",
    start = list(count = c(1.3711, -1.5152, 0.879), zero = c(1.6028, -1.6663)))
  as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2]))
}

set.seed(10)
(res <- boot(zinb, f, R = 1200, parallel = "snow", ncpus = 4))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = zinb, statistic = f, R = 1200, parallel = "snow", 
##     ncpus = 4)
## 
## 
## Bootstrap Statistics :
##        original       bias    std. error
## t1*   1.3710504 -0.083293503  0.39439368
## t2*   0.2561136 -0.002616572  0.03190918
## t3*  -1.5152609 -0.061393126  0.26877799
## t4*   0.1955916  0.006027950  0.02026508
## t5*   0.8790522  0.091642830  0.47150032
## t6*   0.2692734  0.001873325  0.01998125
## t7*  -0.9853566  0.080190346  0.21907572
## t8*   0.1759504  0.002580684  0.01689353
## t9*   1.6031354  0.473483335  1.59330502
## t10*  0.8365225  3.767327930 15.65780310
## t11* -1.6665917 -0.462324216  1.56790036
## t12*  0.6793077  3.771998568 15.69674577

The results are alternating parameter estimates and standard errors. That is, the first row has the first parameter estimate from our model. The second has the standard error for the first parameter. The third column contains the bootstrapped standard errors, which are considerably larger than those estimated by zeroinfl.

Now we can get the confidence intervals for all the parameters. We start on the original scale with percentile and bias adjusted CIs. We also compare these results with the regular confidence intervals based on the standard errors.

## basic parameter estimates with percentile and bias adjusted CIs
parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) {
  out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"))
  with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], 
    bcaLL = bca[4], bcaLL = bca[5]))
}))

## add row names
row.names(parms) <- names(coef(m1))
## print results
parms
##                          Est        pLL        pUL      bcaLL      bcaLL
## count_(Intercept)  1.3710504  0.5132187  2.0539932  0.7563477  2.3268123
## count_child       -1.5152609 -2.1495185 -1.0807496 -1.9866681 -0.9827549
## count_camper1      0.8790522  0.1119459  1.8628376 -0.2267101  1.6676446
## zero_(Intercept)  -0.9853566 -1.3142565 -0.4465283 -1.4471843 -0.6407219
## zero_persons       1.6031354  0.3519777  8.0795923 -0.1857838  3.6867011
## compare with normal based approximation
confint(m1)
##                         2.5 %     97.5 %
## count_(Intercept)  0.86910572  1.8730573
## count_child       -1.89860155 -1.1318879
## count_camper1      0.35126797  1.4068178
## zero_(Intercept)  -0.03635521  3.2418992
## zero_persons      -2.99701357 -0.3354987

The bootstrapped confidence intervals are considerably wider than the normal based approximation. The bootstrapped CIs are more consistent with the CIs from Stata when using robust standard errors.

Now we can estimate the incident risk ratio (IRR) for the negative binomial model and odds ratio (OR) for the logistic (zero inflation) model. This is done using almost identical code as before, but passing a transformation function to the h argument of boot.ci, in this case, exp to exponentiate.

## exponentiated parameter estimates with percentile and bias adjusted CIs
expparms <- t(sapply(c(1, 3, 5, 7, 9), function(i) {
  out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"), h = exp)
  with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], 
    bcaLL = bca[4], bcaLL = bca[5]))
}))

## add row names
row.names(expparms) <- names(coef(m1))
## print results
expparms
##                         Est       pLL          pUL     bcaLL      bcaLL
## count_(Intercept) 3.9394864 1.6706640    7.7989834 2.1304808 10.2452309
## count_child       0.2197509 0.1165403    0.3393411 0.1371516  0.3742786
## count_camper1     2.4086158 1.1184525    6.4420052 0.7971518  5.2996704
## zero_(Intercept)  0.3733061 0.2686740    0.6398457 0.2352317  0.5269119
## zero_persons      4.9685866 1.4218768 3227.9195721 0.8304531 39.9129615

To better understand our model, we can compute the expected number of fish caught for different combinations of our predictors. In fact, since we are working with essentially categorical predictors, we can compute the expected values for all combinations using the expand.grid function to create all combinations and then the predict function to do it. Finally we create a graph.

newdata1 <- expand.grid(0:3, factor(0:1), 1:4)
colnames(newdata1) <- c("child", "camper", "persons")
newdata1$phat <- predict(m1, newdata1)

ggplot(newdata1, aes(x = child, y = phat, colour = factor(persons))) +
  geom_point() +
  geom_line() +
  facet_wrap(~camper) +
  labs(x = "Number of Children", y = "Predicted Fish Caught")

plot of chunk zinb-predict

Things to consider

Here are some issues that you may want to consider in the course of your research analysis.

References