Poisson regression is used to model *count* variables.

library(sandwich) library(msm) library(ggplot2)

**Example 1.**The number of persons killed by mule or horse kicks in the Prussian army per year. Ladislaus Bortkiewicz collected data from 20 volumes of*Preussischen Statistik*. These data were collected on 10 corps of the Prussian army in the late 1800s over the course of 20 years.**Example 2.**The number of people in line in front of you at the grocery store. Predictors may include the number of items currently offered at a special discounted price and whether a special event (e.g., a holiday, a big sporting event) is three or fewer days away.**Example 3.**The number of awards earned by students at one high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general or academic) and the score on their final exam in math.

The most widely used model for count data analysis is Poisson regression. This assumes that \(Y_i\), given the vector of covariates \({\bf x}_i\), is *independently Poisson distributed* with
\[
\mathsf{P}[Y_i=y_i|{\bf x}_i]=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!},\quad y_i=0,1,2,\ldots
\]
and the mean parameter - that is, the mean number of events per period - is given by
\[
\mu_i=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\},
\]
where \({\boldsymbol\beta}\) is a \((k+1)\times 1\) parameter vector. (The intercept is \(\beta_0\); the coefficients for the \(k\) regressors are \(\beta_1,\ldots,\beta_k\).) Taking the exponential of ensures that the mean parameter \(\mu_i\) is nonnegative. It can be shown that the conditional mean is given by
\[
\mathsf{E}[Y_i|{\bf x}_i]=\mu_i=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\}.
\]
The name log-linear model is also used for the Poisson regression model since the logarithm of the conditional mean is linear in the parameters:
\[
\log\mathsf{E}[Y_i|{\bf x}_i]=\log\mu_i={\bf x}_i^{\top}{\boldsymbol\beta}.
\]
Note that the conditional variance of the count random variable is equal to the conditional mean in the Poisson regression model:
\[
\mathsf{Var}[Y_i|{\bf x}_i]=\mathsf{E}[Y_i|{\bf x}_i]=\mu_i.
\]
The equality of the conditional mean and variance of \(Y_i\) is known as *equidispersion*.

The marginal effect of a regressor is given by \[ \frac{\partial \mathsf{E}[Y_i|{\bf x}_i]}{\partial x_{ji}}=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\}\beta_j=\mathsf{E}[Y_i|{\bf x}_i]\beta_j. \] Thus, a one-unit change in the \(j\)-th regressor leads to a proportional change in the conditional mean \(\mathsf{E}[Y_i|{\bf x}_i]\) of \(\beta_j\).

For the purpose of illustration, we have simulated a data set for Example 3 above. In this example, `num_awards`

is the outcome variable and indicates the number of awards earned by students at a high school in a year, `math`

is a continuous predictor variable and represents students' scores on their math final exam, and `prog`

is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = "General", 2 = "Academic" and 3 = "Vocational". Let's start with loading the data and looking at some descriptive statistics.

p <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/poisson_sim.csv") p <- within(p, { prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational")) id <- factor(id) }) summary(p)

## id num_awards prog math ## 1 : 1 Min. :0.00 General : 45 Min. :33.00 ## 2 : 1 1st Qu.:0.00 Academic :105 1st Qu.:45.00 ## 3 : 1 Median :0.00 Vocational: 50 Median :52.00 ## 4 : 1 Mean :0.63 Mean :52.65 ## 5 : 1 3rd Qu.:1.00 3rd Qu.:59.00 ## 6 : 1 Max. :6.00 Max. :75.00 ## (Other):194

We can use the `tapply`

function to display the summary statistics by program type. The table below shows the average numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards, our outcome variable, because the mean value of the outcome appears to vary by `prog`

. Additionally, the means and variances within each level of `prog`

-the *conditional* means and variances-are similar. A conditional histogram separated out by program type is plotted to show the distribution.

with(p, tapply(num_awards, prog, function(x) { sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x)) }))

## General Academic Vocational ## "M (SD) = 0.20 (0.40)" "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)"

ggplot(p, aes(num_awards, fill = prog)) + geom_histogram(binwidth = 0.5, position = "dodge")

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.

*Poisson regression*- Poisson regression is often used for modeling count data. Poisson regression has a number of extensions useful for count models.*Negative binomial regression*- Negative binomial regression can be used for over-dispersed count data, that is when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. If the conditional distribution of the outcome variable is over-dispersed, the confidence intervals for Negative binomial regression are likely to be narrower as compared to those from a Poisson regression.*Zero-inflated regression*model - Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to exist in the data, "true zeros" and "excess zeros". Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros.*OLS regression*- Count outcome variables are sometimes log-transformed and analyzed using OLS regression. Many issues arise with this approach, including loss of data due to undefined values generated by taking the log of zero (which is undefined) and biased estimates.

At this point, we are ready to perform our Poisson model analysis using the `glm`

function. We fit the model and store it in the object `m1`

and get a summary of the model at the same time.

summary(m1 <- glm(num_awards ~ prog + math, family = "poisson", data = p))

## ## Call: ## glm(formula = num_awards ~ prog + math, family = "poisson", data = p) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2043 -0.8436 -0.5106 0.2558 2.6796 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -5.24712 0.65845 -7.969 1.60e-15 *** ## progAcademic 1.08386 0.35825 3.025 0.00248 ** ## progVocational 0.36981 0.44107 0.838 0.40179 ## math 0.07015 0.01060 6.619 3.63e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 287.67 on 199 degrees of freedom ## Residual deviance: 189.45 on 196 degrees of freedom ## AIC: 373.5 ## ## Number of Fisher Scoring iterations: 6

`sandwich`

below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculated the 95% confidence interval using the parameter estimates and their robust standard errors.
cov.m1 <- vcovHC(m1, type = "HC0") std.err <- sqrt(diag(cov.m1)) r.est <- cbind(Estimate = coef(m1), `Robust SE` = std.err, `Pr(>|z|)` = 2 * pnorm(abs(coef(m1)/std.err), lower.tail = FALSE), LL = coef(m1) - 1.96 * std.err, UL = coef(m1) + 1.96 * std.err) r.est

## Estimate Robust SE Pr(>|z|) LL UL ## (Intercept) -5.2471244 0.64599839 4.566630e-16 -6.51328124 -3.98096756 ## progAcademic 1.0838591 0.32104816 7.354745e-04 0.45460476 1.71311353 ## progVocational 0.3698092 0.40041731 3.557157e-01 -0.41500870 1.15462716 ## math 0.0701524 0.01043516 1.783975e-11 0.04969947 0.09060532

`glm`

more closely.
- The output begins with echoing the function call. The information on deviance residuals is displayed next. Deviance residuals are approximately normally distributed if the model is specified correctly. In our example, it shows a little bit of skeweness since median is not quite zero.
- Next come the Poisson regression coefficients for each of the variables along with the standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. The coefficient for
`math`

is .07. This means that the expected log count for a one-unit increase in math is .07. The indicator variable`progAcademic`

compares between`prog`

= "Academic" and`prog`

= "General", the expected log count for`prog`

= "Academic" increases by about 1.1. The indicator variable`prog.Vocational`

is the expected difference in log count (approx .37) between`prog`

= "Vocational" and the reference group (`prog`

= "General"). - The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data. We conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statistically significant, it would indicate that the data do not fit the model well. In that situation, we may try to determine if there are omitted predictor variables, if our linearity assumption holds and/or if there is an issue of over-dispersion.
We can also test the overall effect of
with(m1, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail = FALSE)))

## res.deviance df p ## [1,] 189.4496 196 0.6182274

`prog`

by comparing the deviance of the full model with the deviance of the model excluding`prog`

. The two degree-of-freedom chi-square test indicates that`prog`

, taken together, is a statistically significant predictor of`num_awards`

.Sometimes, we might want to present the regression results as## update m1 model dropping prog m2 <- update(m1, . ~ . - prog) ## test model differences with chi square test anova(m2, m1, test = "Chisq")

## Analysis of Deviance Table ## ## Model 1: num_awards ~ math ## Model 2: num_awards ~ prog + math ## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 198 204.02

## 2 196 189.45 2 14.572 0.0006852 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1*incident rate ratios*and their standard errors, together with the confidence interval. To compute the standard error for the incident rate ratios, we will use the Delta method. To this end, we make use the function`deltamethod`

implemented in R package`msm`

.The output above indicates that the incident rate fors <- deltamethod(list(~exp(x1), ~exp(x2), ~exp(x3), ~exp(x4)), coef(m1), cov.m1) ## exponentiate old estimates dropping the p values rexp.est <- exp(r.est[, -3]) ## replace SEs with estimates for exponentiated coefficients rexp.est[, "Robust SE"] <- s rexp.est

## Estimate Robust SE LL UL ## (Intercept) 0.00526263 0.00339965 0.001483604 0.01866757 ## progAcademic 2.95606545 0.94903937 1.575550534 5.54620292 ## progVocational 1.44745846 0.57958742 0.660334536 3.17284023 ## math 1.07267164 0.01119351 1.050955210 1.09483681

`prog`

= "Academic" is 2.96 times the incident rate for the reference group (`prog`

= "General"). Likewise, the incident rate for`prog`

= "Vocational" is 1.45 times the incident rate for the reference group holding the other variables at constant. The percent change in the incident rate of`num_awards`

is by 7% for every unit increase in`math`

. For additional information on the various metrics in which the results can be presented, and the interpretation of such, please see*Regression Models for Categorical Dependent Variables Using Stata, Second Edition*by J. Scott Long and Jeremy Freese (2006).

Sometimes, we might want to look at the expected marginal means. For example, what are the expected counts for each program type holding math score at its overall mean? To answer this question, we can make use of the `predict`

function. First off, we will make a small data set to apply the `predict`

function to it.

(s1 <- data.frame(math = mean(p$math), prog = factor(1:3, levels = 1:3, labels = levels(p$prog))))

## math prog ## 1 52.645 General ## 2 52.645 Academic ## 3 52.645 Vocational

predict(m1, s1, type = "response", se.fit = TRUE)

## $fit ## 1 2 3 ## 0.2114109 0.6249446 0.3060086 ## ## $se.fit ## 1 2 3 ## 0.07050108 0.08628117 0.08833706 ## ## $residual.scale ## [1] 1

`prog`

is about .21, holding `math`

at its mean. The predicted number of events for level 2 of `prog`

is higher at .62, and the predicted number of events for level 3 of `prog`

is about .31. The ratios of these predicted counts (.625./211=2.96, .306/.211=1.45) match what we saw looking at the IRR.
We can also graph the predicted number of events with the commands below. The graph indicates that the most awards are predicted for those in the academic program (`prog`

= 2), especially if the student has a high math score. The lowest number of predicted awards is for those students in the general program (`prog`

= 1). The graph overlays the lines of expected values onto the actual points, although a small amount of random noise was added vertically to lessen overplotting.

## calculate and store predicted values p$phat <- predict(m1, type = "response") ## order by program and then by math p <- p[with(p, order(prog, math)), ]

## create the plot ggplot(p, aes(x = math, y = phat, colour = prog)) + geom_point(aes(y = num_awards), alpha = 0.5, position = position_jitter(h = 0.2)) + geom_line(size = 1) + labs(x = "Math Score", y = "Expected number of awards")

- When there seems to be an issue of dispersion, we should first check if our model is appropriately specified, such as omitted variables and functional forms. For example, if we omitted the predictor variable
`prog`

in the example above, our model would seem to have a problem with over-dispersion. In other words, a misspecified model could present a symptom like an over-dispersion problem. - Assuming that the model is correctly specified, the assumption that the conditional variance is equal to the conditional mean should be checked. There are several tests including the likelihood ratio test of over-dispersion parameter alpha by running the same model using negative binomial distribution.
`R`

package`pscl`

(Political Science Computational Laboratory, Stanford University) provides many functions for binomial and count data including`odTest`

for testing over-dispersion. - One common cause of over-dispersion is excess zeros, which in turn are generated by an additional data generating process. In this situation,
*zero-inflated*model should be considered. - If the data generating process does not allow for any 0s (such as the number of days spent in the hospital), then a
*zero-truncated*model may be more appropriate. - Count data often have an exposure variable, which indicates the number of times the event could have happened. This variable should be incorporated into a Poisson model with the use of the
`offset`

option. - The outcome variable in a Poisson regression cannot have negative numbers, and the exposure cannot have 0s.
- Many different measures of pseudo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression, even though none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squares, see Long and Freese (2006).
- Poisson regression is estimated via maximum likelihood estimation. It usually requires a large sample size.
- Other link functions should be tried, i.e., the
*identity link function*or the*square root link function*.

- UCLA: IDRE (Institute for Digital Research and Education). Data Analysis Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
- SAS 9.3 (2013). PROC COUNTREG help page. SAS Institute, Cary NC.
- Cameron, A. C. and Trivedi, P. K. (2009). Microeconometrics Using Stata. College Station, TX: Stata Press.
- Cameron, A. C. and Trivedi, P. K. (1998). Regression Analysis of Count Data. New York: Cambridge Press.
- Dupont, W. D. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. New York: Cambridge Press.
- Long, J. S. and Freese, J. (2006). Regression Models for Categorical Dependent Variables Using Stata, Second Edition. College Station, TX: Stata Press.
- Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.