Poisson regression

Poisson regression is used to model count variables.

library(sandwich)
library(msm)
library(ggplot2)

Examples of Poisson regression

Theoretical background

The most widely used model for count data analysis is Poisson regression. This assumes that \(Y_i\), given the vector of covariates \({\bf x}_i\), is independently Poisson distributed with \[ \mathsf{P}[Y_i=y_i|{\bf x}_i]=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!},\quad y_i=0,1,2,\ldots \] and the mean parameter - that is, the mean number of events per period - is given by \[ \mu_i=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\}, \] where \({\boldsymbol\beta}\) is a \((k+1)\times 1\) parameter vector. (The intercept is \(\beta_0\); the coefficients for the \(k\) regressors are \(\beta_1,\ldots,\beta_k\).) Taking the exponential of ensures that the mean parameter \(\mu_i\) is nonnegative. It can be shown that the conditional mean is given by \[ \mathsf{E}[Y_i|{\bf x}_i]=\mu_i=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\}. \] The name log-linear model is also used for the Poisson regression model since the logarithm of the conditional mean is linear in the parameters: \[ \log\mathsf{E}[Y_i|{\bf x}_i]=\log\mu_i={\bf x}_i^{\top}{\boldsymbol\beta}. \] Note that the conditional variance of the count random variable is equal to the conditional mean in the Poisson regression model: \[ \mathsf{Var}[Y_i|{\bf x}_i]=\mathsf{E}[Y_i|{\bf x}_i]=\mu_i. \] The equality of the conditional mean and variance of \(Y_i\) is known as equidispersion.

The marginal effect of a regressor is given by \[ \frac{\partial \mathsf{E}[Y_i|{\bf x}_i]}{\partial x_{ji}}=\exp\{{\bf x}_i^{\top}{\boldsymbol\beta}\}\beta_j=\mathsf{E}[Y_i|{\bf x}_i]\beta_j. \] Thus, a one-unit change in the \(j\)-th regressor leads to a proportional change in the conditional mean \(\mathsf{E}[Y_i|{\bf x}_i]\) of \(\beta_j\).

Number of awards data

For the purpose of illustration, we have simulated a data set for Example 3 above. In this example, num_awards is the outcome variable and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students' scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = "General", 2 = "Academic" and 3 = "Vocational". Let's start with loading the data and looking at some descriptive statistics.

p <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/poisson_sim.csv")
p <- within(p, {
    prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
    id <- factor(id)
})
summary(p)
##        id        num_awards           prog          math      
##  1      :  1   Min.   :0.00   General   : 45   Min.   :33.00  
##  2      :  1   1st Qu.:0.00   Academic  :105   1st Qu.:45.00  
##  3      :  1   Median :0.00   Vocational: 50   Median :52.00  
##  4      :  1   Mean   :0.63                    Mean   :52.65  
##  5      :  1   3rd Qu.:1.00                    3rd Qu.:59.00  
##  6      :  1   Max.   :6.00                    Max.   :75.00  
##  (Other):194

Each variable has 200 valid observations and their distributions seem quite reasonable. The unconditional mean and variance of our outcome variable are not extremely different. Our model assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).

We can use the tapply function to display the summary statistics by program type. The table below shows the average numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards, our outcome variable, because the mean value of the outcome appears to vary by prog. Additionally, the means and variances within each level of prog-the conditional means and variances-are similar. A conditional histogram separated out by program type is plotted to show the distribution.

with(p, tapply(num_awards, prog, function(x) {
    sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                General               Academic             Vocational 
## "M (SD) = 0.20 (0.40)" "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)"

ggplot(p, aes(num_awards, fill = prog)) + geom_histogram(binwidth = 0.5, position = "dodge")

plot of chunk histcounts

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.

Poisson regression model

At this point, we are ready to perform our Poisson model analysis using the glm function. We fit the model and store it in the object m1 and get a summary of the model at the same time.

summary(m1 <- glm(num_awards ~ prog + math, family = "poisson", data = p))
## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
## progAcademic    1.08386    0.35825   3.025  0.00248 ** 
## progVocational  0.36981    0.44107   0.838  0.40179    
## math            0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean. We use R package sandwich below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculated the 95% confidence interval using the parameter estimates and their robust standard errors.

cov.m1 <- vcovHC(m1, type = "HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate = coef(m1), `Robust SE` = std.err, `Pr(>|z|)` = 2 * 
    pnorm(abs(coef(m1)/std.err), lower.tail = FALSE), LL = coef(m1) - 1.96 * 
    std.err, UL = coef(m1) + 1.96 * std.err)

r.est
##                  Estimate  Robust SE     Pr(>|z|)          LL          UL
## (Intercept)    -5.2471244 0.64599839 4.566630e-16 -6.51328124 -3.98096756
## progAcademic    1.0838591 0.32104816 7.354745e-04  0.45460476  1.71311353
## progVocational  0.3698092 0.40041731 3.557157e-01 -0.41500870  1.15462716
## math            0.0701524 0.01043516 1.783975e-11  0.04969947  0.09060532

Now let's look at the output of function glm more closely.

Sometimes, we might want to look at the expected marginal means. For example, what are the expected counts for each program type holding math score at its overall mean? To answer this question, we can make use of the predict function. First off, we will make a small data set to apply the predict function to it.

(s1 <- data.frame(math = mean(p$math), prog = factor(1:3, levels = 1:3, labels = levels(p$prog))))
##     math       prog
## 1 52.645    General
## 2 52.645   Academic
## 3 52.645 Vocational
predict(m1, s1, type = "response", se.fit = TRUE)
## $fit
##         1         2         3 
## 0.2114109 0.6249446 0.3060086 
## 
## $se.fit
##          1          2          3 
## 0.07050108 0.08628117 0.08833706 
## 
## $residual.scale
## [1] 1

In the output above, we see that the predicted number of events for level 1 of prog is about .21, holding math at its mean. The predicted number of events for level 2 of prog is higher at .62, and the predicted number of events for level 3 of prog is about .31. The ratios of these predicted counts (.625./211=2.96, .306/.211=1.45) match what we saw looking at the IRR.

We can also graph the predicted number of events with the commands below. The graph indicates that the most awards are predicted for those in the academic program (prog = 2), especially if the student has a high math score. The lowest number of predicted awards is for those students in the general program (prog = 1). The graph overlays the lines of expected values onto the actual points, although a small amount of random noise was added vertically to lessen overplotting.

## calculate and store predicted values
p$phat <- predict(m1, type = "response")

## order by program and then by math
p <- p[with(p, order(prog, math)), ]

## create the plot
ggplot(p, aes(x = math, y = phat, colour = prog)) + geom_point(aes(y = num_awards), 
    alpha = 0.5, position = position_jitter(h = 0.2)) + geom_line(size = 1) + 
    labs(x = "Math Score", y = "Expected number of awards")

plot of chunk predictedcounts

Things to consider

References