Negative binomial regression is for modeling count variables, usually for over-dispersed count outcome variables.
library(foreign) library(MASS) library(ggplot2)
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math.
Example 2. A health-related researcher is studying the number of hospital visits in past 12 months by senior citizens in a community based on the characteristics of the individuals and the types of health plans under which each one is covered.
The Poisson regression model can be generalized by introducing an unobserved heterogeneity term for observation \(i\). Thus, the individuals are assumed to differ randomly in a manner that is not fully accounted for by the observed covariates. This is formulated as
\[
\mathsf{E}[Y_i|{\bf x}_i,\tau_i]=\mu_i\tau_i=e^{{\bf x}_i^{\top}{\boldsymbol\beta}+\varepsilon_i},
\]
where the unobserved heterogeneity term \(\tau_i=e^{\varepsilon_i}\) is independent of the vector of regressors \({\bf x}_i\). Then the distribution of \(Y_i\) conditional on \({\bf x}_i\) and \(\tau_i\) is Poisson with conditional mean and conditional variance \(\mu_i\tau_i\):
\[
f(y_i|{\bf x}_i,\tau_i)=\frac{e^{-\mu_i\tau_i}(\mu_i\tau_i)^{y_i}}{y_i!},\quad y_i=0,1,2,\ldots
\]
Let \(g(\tau_i)\) be the probability density function of \(\tau_i\). Then, the distribution \(f(y_i|{\bf x}_i)\) (no longer conditional on \(\tau_i\)) is obtained by integrating \(f(y_i|{\bf x}_i,\tau_i)\) with respect to \(\tau_i\):
\[
f(y_i|{\bf x}_i)=\int_0^{\infty}f(y_i|{\bf x}_i,\tau_i)g(\tau_i)d\tau_i.
\]
An analytical solution to this integral exists when \(\tau_i\) is assumed to follow a gamma distribution. This solution is the negative binomial distribution. When the model contains a constant term, it is necessary to assume that \(\mathsf{E}e^{\varepsilon_i}=\mathsf{E}\tau_i=1\), in order to identify the mean of the distribution. Thus, it is assumed that \(\tau_i\) follows a gamma(\(\theta\),\(\theta\)) distribution with \(\mathsf{E}\tau_i=1\) and \(\mathsf{Var}\tau_i=1/\theta\):
\[
g(\tau_i)=\frac{\theta^{\theta}}{\Gamma(\theta)}\tau_i^{\theta-1}\exp\{-\theta\tau_i\},
\]
where \(\Gamma(x)=\int_0^{\infty}z^{x-1}\exp\{-z\}dz\) is the gamma function and \(\theta\) is a positive parameter. Then, the density of \(Y_i\) given \({\bf x}_i\) is derived as
\[
f(y_i|{\bf x}_i)=\frac{\Gamma(y_i+\theta)}{y_i!\Gamma(\theta)}\left(\frac{\theta}{\theta+\mu_i}\right)^{\theta}\left(\frac{\mu_i}{\theta+\mu_i}\right)^{y_i}.
\]
Making the substitution \(\alpha=1/\theta\) (\(\alpha>0\)), the negative binomial distribution can then be rewritten as
\[
f(y_i|{\bf x}_i)=\frac{\Gamma(y_i+\alpha^{-1})}{y_i!\Gamma(\alpha^{-1})}\left(\frac{\alpha^{-1}}{\alpha^{-1}+\mu_i}\right)^{\alpha^{-1}}\left(\frac{\mu_i}{\alpha^{-1}+\mu_i}\right)^{y_i},\quad y_i=0,1,2,\ldots
\]
Thus, the negative binomial distribution is derived as a gamma mixture of Poisson random variables. It has conditional mean
\[
\mathsf{E}[Y_i|{\bf x}_i]=e^{{\bf x}_i^{\top}{\boldsymbol\beta}}
\]
and conditional variance
\[
\mathsf{Var}[Y_i|{\bf x}_i]=\mu_i(1+\mu_i/\theta)=\mu_i(1+\alpha\mu_i)>\mathsf{E}[Y_i|{\bf x}_i].
\]
The conditional variance of the negative binomial distribution exceeds the conditional mean. Overdispersion results from neglected unobserved heterogeneity. The negative binomial model with variance function \(\mathsf{Var}[Y_i|{\bf x}_i]=\mu_i+\alpha\mu_i^2\), which is quadratic in the mean, is referred to as the NB2 model. The Poisson distribution is a special case of the negative binomial distribution where \(\alpha=0\). A test of the Poisson distribution can be carried out by testing the hypothesis that \(\alpha=0\). A Wald test of this hypothesis is used.
Let's pursue Example 1 from above.
We have attendance data on 314 high school juniors from two urban high schools in the file nb_data
. The response variable of interest is days absent, daysabs
. The variable math
gives the standardized math score for each student. The variable prog
is a three-level nominal variable indicating the type of instructional program in which the student is enrolled.
Let's look at the data. It is always a good idea to start with descriptive statistics and plots.
dat <- read.dta("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/nb_data.dta") dat <- within(dat, { prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational")) id <- factor(id) }) summary(dat)
## id gender math daysabs ## 1001 : 1 female:160 Min. : 1.00 Min. : 0.000 ## 1002 : 1 male :154 1st Qu.:28.00 1st Qu.: 1.000 ## 1003 : 1 Median :48.00 Median : 4.000 ## 1004 : 1 Mean :48.27 Mean : 5.955 ## 1005 : 1 3rd Qu.:70.00 3rd Qu.: 8.000 ## 1006 : 1 Max. :99.00 Max. :35.000 ## (Other):308 ## prog ## General : 40 ## Academic :167 ## Vocational:107 ## ## ## ##
ggplot(dat, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + facet_grid(prog ~ ., margins = TRUE, scales = "free")
Let's continue with our description of the variables in this dataset. The table below shows the average numbers of days absent by program type and seems to suggest that program type is a good candidate for predicting the number of days absent, our outcome variable, because the mean value of the outcome appears to vary by prog
. The variances within each level of prog
are higher than the means within each level. These are the conditional means and variances. These differences suggest that over-dispersion is present and that a Negative Binomial model would be appropriate.
with(dat, tapply(daysabs, prog, function(x) { sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x)) }))
## General Academic Vocational ## "M (SD) = 10.65 (8.20)" "M (SD) = 6.93 (7.45)" "M (SD) = 2.67 (3.73)"
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.
Below we use the glm.nb
function from the MASS
package to estimate a negative binomial regression.
summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
## ## Call: ## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, ## link = log) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.1547 -1.0192 -0.3694 0.2285 2.5273 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.615265 0.197460 13.245 < 2e-16 *** ## math -0.005993 0.002505 -2.392 0.0167 * ## progAcademic -0.440760 0.182610 -2.414 0.0158 * ## progVocational -1.278651 0.200720 -6.370 1.89e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1) ## ## Null deviance: 427.54 on 313 degrees of freedom ## Residual deviance: 358.52 on 310 degrees of freedom ## AIC: 1741.3 ## ## Number of Fisher Scoring iterations: 1 ## ## ## Theta: 1.033 ## Std. Err.: 0.106 ## ## 2 x log-likelihood: -1731.258
math
has a coefficient of -0.006, which is statistically significant. This means that for each one-unit increase in math
, the expected log count of the number of days absent decreases by 0.006. The indicator variable shown as progAcademic is the expected difference in log count between group 2 and the reference group (prog
=1). The expected log count for level 2 of prog is 0.44 lower than the expected log count for level 1. The indicator variable for progVocational
is the expected difference in log count between group 3 and the reference group.The expected log count for level 3 of prog
is 1.28 lower than the expected log count for level 1. To determine if prog
itself, overall, is statistically significant, we can compare a model with and without prog
. The reason it is important to fit separate models, is that unless we do, the overdispersion parameter is held constant.
m2 <- update(m1, . ~ . - prog) anova(m1, m2)
## Likelihood ratio tests of Negative Binomial Models ## ## Response: daysabs ## Model theta Resid. df 2 x log-lik. Test df LR stat. ## 1 math 0.8558565 312 -1776.306 ## 2 math + prog 1.0327132 310 -1731.258 1 vs 2 2 45.04798 ## Pr(Chi) ## 1 ## 2 1.65179e-10
daysabs
.As we mentioned earlier, negative binomial models assume the conditional means are not equal to the conditional variances. This inequality is captured by estimating a dispersion parameter (not shown in the output) that is held constant in a Poisson model. Thus, the Poisson model is actually nested in the negative binomial model. We can then use a likelihood ratio test to compare these two and test this model assumption. To do this, we will run our model as a Poisson.
m3 <- glm(daysabs ~ math + prog, family = "poisson", data = dat) pchisq(2 * (logLik(m1) - logLik(m3)), df = 1, lower.tail = FALSE)
## 'log Lik.' 2.157298e-203 (df=5)
We can get the confidence intervals for the coefficients by profiling the likelihood function.
(est <- cbind(Estimate = coef(m1), confint(m1)))
## Estimate 2.5 % 97.5 % ## (Intercept) 2.615265446 2.24205576 3.012935926 ## math -0.005992988 -0.01090086 -0.001066615 ## progAcademic -0.440760012 -0.81006586 -0.092643481 ## progVocational -1.278650721 -1.68348970 -0.890077623
exp(est)
## Estimate 2.5 % 97.5 % ## (Intercept) 13.6708448 9.4126616 20.3470498 ## math 0.9940249 0.9891583 0.9989340 ## progAcademic 0.6435471 0.4448288 0.9115184 ## progVocational 0.2784127 0.1857247 0.4106239
prog
= 2 is 0.64 times the incident rate for the reference group (prog
= 1). Likewise, the incident rate for prog
= 3 is 0.28 times the incident rate for the reference group holding the other variables constant. The percent change in the incident rate of daysabs
is a 1% decrease for every unit increase in math
.
The form of the model equation for negative binomial regression is the same as that for Poisson regression. The log of the outcome is predicted with a linear combination of the predictors: \[ \log(\widehat{daysabs}_i)=\widehat{\beta}_0+\widehat{\beta}_1\mathcal{I}(prog_i=2)+\widehat{\beta}_2\mathcal{I}(prog_i=3)+\widehat{\beta}_3 math_i \] The coefficients have an additive effect in the \(\log(y)\) scale and the IRR have a multiplicative effect in the \(y\) scale. The dispersion parameter in negative binomial regression does not effect the expected counts, but it does effect the estimated variance of the expected counts. More details can be found in the Modern Applied Statistics with S by W.N. Venables and B.D. Ripley (the book companion of the MASS package).
For additional information on the various metrics in which the results can be presented, and the interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long and Jeremy Freese (2006).
For assistance in further understanding the model, we can look at predicted counts for various levels of our predictors. Below we create new datasets with values of math
and prog
and then use the predict
command to calculate the predicted number of events.
First, we can look at predicted counts for each value of prog
while holding math
at its mean. To do this, we create a new dataset with the combinations of prog
and math
for which we would like to find predicted values, then use the predict
command.
newdata1 <- data.frame(math = mean(dat$math), prog = factor(1:3, levels = 1:3, labels = levels(dat$prog))) newdata1$phat <- predict(m1, newdata1, type = "response") newdata1
## math prog phat ## 1 48.26752 General 10.236899 ## 2 48.26752 Academic 6.587927 ## 3 48.26752 Vocational 2.850083
math
at its mean. The predicted number of events for an academic program is lower at 6.59, and the predicted number of events for a vocational program is about 2.85.
Below we will obtain the mean predicted number of events for values of math across its entire range for each level of prog
and graph
these.
newdata2 <- data.frame( math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3), prog = factor(rep(1:3, each = 100), levels = 1:3, labels = levels(dat$prog))) newdata2 <- cbind(newdata2, predict(m1, newdata2, type = "link", se.fit=TRUE)) newdata2 <- within(newdata2, { DaysAbsent <- exp(fit) LL <- exp(fit - 1.96 * se.fit) UL <- exp(fit + 1.96 * se.fit) })
ggplot(newdata2, aes(math, DaysAbsent)) + geom_ribbon(aes(ymin = LL, ymax = UL, fill = prog), alpha = .25) + geom_line(aes(colour = prog), size = 2) + labs(x = "Math Score", y = "Predicted Days Absent")
offset
option. See the glm
documentation for details.m1$resid
command to obtain the residuals from our model to check other assumptions of the negative binomial model (see Cameron and Trivedi (1998) and Dupont (2002) for more information).