*Ordinal logistic regression* is a type of *multonomial logistic regression* used to model *ordinal* outcome variables.

library(foreign) library(ggplot2) library(reshape2) library(MASS) library(Hmisc)

**Example 1.**A marketing research firm wants to investigate what factors influence the size of soda (small, medium, large or extra large) that people order at a fast-food chain. These factors may include what type of sandwich is ordered (burger or chicken), whether or not fries are also ordered, and age of the consumer. While the outcome variable, size of soda, is obviously ordered, the difference between the various sizes is not consistent. The differece between small and medium is 10 ounces, between medium and large 8, and between large and extra large 12.**Example 2.**A researcher is interested in what factors influence medaling in Olympic swimming. Relevant predictors include at training hours, diet, age, and popularity of swimming in the athlete's home country. The researcher believes that the distance between gold and silver is larger than the distance between silver and bronze.**Example 3.**A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories. Data on parental educational status, whether the undergraduate institution is public or private, and current GPA is also collected. The researchers have reason to believe that the "distances" between these three points are not equal. For example, the "distance" between "unlikely" and "somewhat likely" may be shorter than the distance between "somewhat likely" and "very likely".

For our data analysis below, we are going to expand on Example 3 about applying to graduate school. We have simulated some data for this example and it can be obtained from the website:

dat <- read.dta("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/ologit.dta") head(dat)

## apply pared public gpa ## 1 very likely 0 0 3.26 ## 2 somewhat likely 1 0 3.21 ## 3 unlikely 1 1 3.94 ## 4 somewhat likely 0 0 2.81 ## 5 somewhat likely 0 0 2.53 ## 6 unlikely 0 1 2.59

`apply`

, with levels "unlikely", "somewhat likely", and "very likely", coded 1, 2, and 3, respectively, that we will use as our outcome variable. We also have three variables that we will use as predictors: `pared`

, which is a 0/1 variable indicating whether at least one parent has a graduate degree; `public`

, which is a 0/1 variable where 1 indicates that the undergraduate institution is public and 0 private, and `gpa`

, which is the student's grade point average. Let's start with the descriptive statistics of these variables.
## one at a time, table apply, pared, and public lapply(dat[, c("apply", "pared", "public")], table)

## $apply ## ## unlikely somewhat likely very likely ## 220 140 40 ## ## $pared ## ## 0 1 ## 337 63 ## ## $public ## ## 0 1 ## 343 57

## three way cross tabs (xtabs) and flatten the table ftable(xtabs(~public + apply + pared, data = dat))

## pared 0 1 ## public apply ## 0 unlikely 175 14 ## somewhat likely 98 26 ## very likely 20 10 ## 1 unlikely 25 6 ## somewhat likely 12 4 ## very likely 7 3

summary(dat$gpa)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.900 2.720 2.990 2.999 3.270 4.000

sd(dat$gpa)

## [1] 0.3979409

We can also examine the distribution of `gpa`

at every level of `apply`

and broken down by `public`

and `pared`

. This creates a 2 x 2 grid with a boxplot of `gpa`

for every level of `apply`

, for particular values of `pared`

and `public`

. To better see the data, we also add the raw data points on top of the box plots, with a small amount of noise (often called "jitter") and 50% transparency so they do not overwhelm the boxplots. Finally, in addition to the cells, we plot all of the marginal relationships. The margins make the final plot a 3 x 3 grid. In the lower right hand corner, is the overall relationship between `apply`

and `gpa`

which appears slightly positive. To do this, we use the `ggplot2`

package.

ggplot(dat, aes(x = apply, y = gpa)) + geom_boxplot(size = .75) + geom_jitter(alpha = .5) + facet_grid(pared ~ public, margins = TRUE) + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

- Ordered logistic regression: the focus of this page.
- OLS regression: This analysis is problematic because the assumptions of OLS are violated when it is used with a non-interval outcome variable.
- ANOVA: If you use only one continuous predictor, you could "flip" the model around so that, say, gpa was the outcome variable and apply was the predictor variable. Then you could run a one-way ANOVA. This isn't a bad thing to do if you only have one predictor variable (from the logistic model), and it is continuous.
- Multinomial logistic regression: This is similar to doing ordered logistic regression, except that it is assumed that there is no order to the categories of the outcome variable (i.e., the categories are nominal). The downside of this approach is that the information contained in the ordering is lost.
- Ordered probit regression: This is very, very similar to running an ordered logistic regression. The main difference is in the interpretation of the coefficients.

Below we use the `polr`

command from the `MASS`

package to estimate an ordered logistic regression model. The command name comes from proportional odds logistic regression, highlighting the proportional odds assumption in our model. `polr`

uses the standard formula interface in `R`

for specifying a regression model with outcome followed by predictors. We also specify `Hess=TRUE`

to have the model return the observed information matrix from optimization (called the Hessian) which is used to get standard errors.

## fit ordered logit model and store results 'm' m <- polr(apply ~ pared + public + gpa, data = dat, Hess = TRUE) ## view a summary of the model summary(m)

## Call: ## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE) ## ## Coefficients: ## Value Std. Error t value ## pared 1.04769 0.2658 3.9418 ## public -0.05879 0.2979 -0.1974 ## gpa 0.61594 0.2606 2.3632 ## ## Intercepts: ## Value Std. Error t value ## unlikely|somewhat likely 2.2039 0.7795 2.8272 ## somewhat likely|very likely 4.2994 0.8043 5.3453 ## ## Residual Deviance: 717.0249 ## AIC: 727.0249

- Call, this is R reminding us what type of model we ran, what options we specified, etc.
- Next we see the usual regression output coefficient table including the value of each coefficient, standard errors, and t value, which is simply the ratio of the coefficient to its standard error. There is no significance test by default.
- Next we see the estimates for the two intercepts, which are sometimes called cutpoints. The intercepts indicate where the latent variable is cut to make the three groups that we observe in our data. Note that this latent variable is continuous. In general, these are not used in the interpretation of the results. The cutpoints are closely related to thresholds, which are reported by other statistical packages.
- Finally, we see the residual deviance, -2 * Log Likelihood of the model as well as the AIC. Both the deviance and AIC are useful for model comparison.

Some people are not satisfied without a p value. One way to calculate a p-value in this case is by comparing the t-value against the standard normal distribution, like a z test. Of course this is only true with infinite degrees of freedom, but is reasonably approximated by large samples, becoming increasingly biased as sample size decreases. This approach is used in other software packages such as `Stata`

and is trivial to do. First we store the coefficient table, then calculate the pvalues and combine back with the table.

## store table (ctable <- coef(summary(m)))

## Value Std. Error t value ## pared 1.04769010 0.2657894 3.9418050 ## public -0.05878572 0.2978614 -0.1973593 ## gpa 0.61594057 0.2606340 2.3632399 ## unlikely|somewhat likely 2.20391473 0.7795455 2.8271792 ## somewhat likely|very likely 4.29936315 0.8043267 5.3452947

## calculate and store p values p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2 ## combined table (ctable <- cbind(ctable, `p value` = p))

## Value Std. Error t value p value ## pared 1.04769010 0.2657894 3.9418050 8.087072e-05 ## public -0.05878572 0.2978614 -0.1973593 8.435464e-01 ## gpa 0.61594057 0.2606340 2.3632399 1.811594e-02 ## unlikely|somewhat likely 2.20391473 0.7795455 2.8271792 4.696004e-03 ## somewhat likely|very likely 4.29936315 0.8043267 5.3452947 9.027008e-08

(ci <- confint(m)) # default method gives profiled CIs

## 2.5 % 97.5 % ## pared 0.5281768 1.5721750 ## public -0.6522060 0.5191384 ## gpa 0.1076202 1.1309148

confint.default(m) # CIs assuming normality

## 2.5 % 97.5 % ## pared 0.5267524 1.5686278 ## public -0.6425833 0.5250119 ## gpa 0.1051074 1.1267737

`pared`

and `gpa`

do not include 0; `public`

does. The estimates in the output are given in units of ordered logits, or ordered log odds. So for `pared`

, we would say that for a one unit increase in `pared`

(i.e., going from 0 to 1), we expect a 1.05 increase in the expect value of `apply`

on the log odds scale, given all of the other variables in the model are held constant. For `gpa`

, we would say that for a one unit increase in `gpa`

, we would expect a 0.62 increase in the expected value of `apply`

in the log odds scale, given that all of the other variables in the model are held constant.
The coefficients from the model can be somewhat difficult to interpret because they are scaled in terms of logs. Another way to interpret logistic regression models is to convert the coefficients into odds ratios. To get the OR and confidence intervals, we just exponentiate the estimates and confidence intervals.

## odds ratios exp(coef(m))

## pared public gpa ## 2.8510579 0.9429088 1.8513972

## OR and CI exp(cbind(OR = coef(m), ci))

## OR 2.5 % 97.5 % ## pared 2.8510579 1.6958376 4.817114 ## public 0.9429088 0.5208954 1.680579 ## gpa 1.8513972 1.1136247 3.098490

`pared`

, we would say that for a one unit increase in parental education, i.e., going from 0 (Low) to 1 (High), the odds of "very likely" applying versus "somewhat likely" or "unlikely" applying combined are 2.85 greater, given that all of the other variables in the model are held constant. Likewise, the odds "very likely" or "somewhat likely" applying versus "unlikely" applying is 2.85 times greater, given that all of the other variables in the model are held constant. For `gpa`

(and other continuous variables), the interpretation is that when a student's gpa moves 1 unit, the odds of moving from "unlikely" applying to "somewhat likely" or "very likley" applying (or from the lower and middle categories to the high category) are multiplied by 1.85.
One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption. Because the relationship between all pairs of groups is the same, there is only one set of coefficients. If this was not the case, we would need different sets of coefficients in the model to describe the relationship between each pair of outcome groups. Thus, in order to asses the appropriateness of our model, we need to evaluate whether the proportional odds assumption is tenable. Statistical tests to do this are available in some software packages. However, these tests have been criticized for having a tendency to reject the null hypothesis (that the sets of coefficients are the same), and hence, indicate that there the parallel slopes assumption does not hold, in cases where the assumption does hold (see Harrell 2001 p. 335). We were unable to locate a facility in R to perform any of the tests commonly used to test the parallel slopes assumption. However, Harrell does recommend a graphical method for assessing the parallel slopes assumption. The values displayed in this graph are essentially (linear) predictions from a logit model, used to model the probability that y is greater than or equal to a given value (for each level of y), using one predictor (x) variable at a time. In order create this graph, you will need the `Hmisc`

library.

The code below contains two commands (the first command falls on multiple lines) and is used to create this graph to test the proportional odds assumption. Basically, we will graph predicted logits from individual logistic regressions with a single predictor where the outcome groups are defined by either `apply`

>= 2 and `apply`

>= 3. If the difference between predicted logits for varying levels of a predictor, say `pared`

, are the same whether the outcome is defined by `apply`

>= 2 or `apply`

>=3, then we can be confident that the proportional odds assumption holds. In other words, if the difference between logits for `pared`

= 0 and `pared`

= 1 is the same when the outcome is `apply`

>= 2 as the difference when the outcome is `apply`

>= 3, then the proportional odds assumption likely holds.

The first command creates the function that estimates the values that will be graphed. The first line of this command tells `R`

that `sf`

is a function, and that this function takes one argument, which we label y. The `sf`

function will calculate the log odds of being greater than or equal to each value of the target variable. For our purposes, we would like the log odds of apply being greater than or equal to 2, and then greater than or equal to 3. Depending on the number of categories in your dependent variable, and the coding of your variables, you may have to edit this function. Below the function is configured for a y variable with three levels, 1, 2, 3. If your dependent variable has 4 levels, labeled 1, 2, 3, 4 you would need to add `'Y>=4'=qlogis(mean(y >= 4))`

(minus the quotation marks) inside the first set of parentheses. If your dependent variable were coded 0, 1, 2 instead of 1, 2, 3, you would need to edit the code, replacing each instance of 1 with 0, 2 with 1, and so on. Inside the `sf`

function we find the `qlogis`

function, which transforms a probability to a logit. So, we will basically feed probabilities of apply being greater than 2 or 3 to qlogis, and it will return the logit transformations of these probabilites. Inside the qlogis function we see that we want the log odds of the mean of `y >= 2`

. When we supply a y argument, such as apply, to function sf, `y >= 2`

will evaluate to a 0/1 (FALSE/TRUE) vector, and taking the mean of that vector will give you the proportion of or probability that `apply`

>= 2.

The second command below calls the function sf on several subsets of the data defined by the predictors. In this statement we see the `summary`

function with a formula supplied as the first argument. When `R`

sees a call to `summary`

with a formula argument, it will calculate descriptive statistics for the variable on the left side of the formula by groups on the right side of the formula and will return the results in a nice table. By default, `summary`

will calculate the mean of the left side variable. So, if we had used the code `summary(as.numeric(apply) ~ pared + public + gpa)`

without the `fun`

argument, we would get means on apply by `pared`

, then by `public`

, and finally by `gpa`

broken up into 4 equal groups. However, we can override calculation of the mean by supplying our own function, namely `sf`

to the `fun=`

argument. The final command asks `R`

to return the contents to the object `s`

, which is a table.

sf <- function(y) { c(`Y>=1` = qlogis(mean(y >= 1)), `Y>=2` = qlogis(mean(y >= 2)), `Y>=3` = qlogis(mean(y >= 3))) } (s <- with(dat, summary(as.numeric(apply) ~ pared + public + gpa, fun = sf)))

## as.numeric(apply) N=400 ## ## +-------+-----------+---+----+-----------+---------+ ## | | |N |Y>=1|Y>=2 |Y>=3 | ## +-------+-----------+---+----+-----------+---------+ ## |pared |No |337|Inf |-0.37833644|-2.440735| ## | |Yes | 63|Inf | 0.76546784|-1.347074| ## +-------+-----------+---+----+-----------+---------+ ## |public |No |343|Inf |-0.20479441|-2.345006| ## | |Yes | 57|Inf |-0.17589067|-1.547563| ## +-------+-----------+---+----+-----------+---------+ ## |gpa |[1.90,2.73)|102|Inf |-0.39730180|-2.772589| ## | |[2.73,3.00)| 99|Inf |-0.26415158|-2.302585| ## | |[3.00,3.28)|100|Inf |-0.20067070|-2.090741| ## | |[3.28,4.00]| 99|Inf | 0.06062462|-1.803594| ## +-------+-----------+---+----+-----------+---------+ ## |Overall| |400|Inf |-0.20067070|-2.197225| ## +-------+-----------+---+----+-----------+---------+

`apply`

) is less than some value a, and 1 if the ordinal variable is greater than or equal to a (note, this is what the ordinal regression model coefficients represent as well). This is done for k-1 levels of the ordinal variable and is executed by the `as.numeric(apply) >= a`

coding below. The first line of code estimates the effect of `pared`

on choosing "unlikely" applying versus "somewhat likely" or "very likely". The second line of code estimates the effect of `pared`

on choosing "unlikely" or "somewhat likely" applying versus "very likely" applying. Looking at the intercept for this model (-0.3783), we see that it matches the predicted value in the cell for `pared`

equal to "no" in the column for Y>=1, the value below it, for `pared`

equals "yes" is equal to the intercept plus the coefficient for `pared`

(i.e. -0.3783 + 1.1438 = 0.765).
glm(I(as.numeric(apply) >= 2) ~ pared, family = "binomial", data = dat)

## ## Call: glm(formula = I(as.numeric(apply) >= 2) ~ pared, family = "binomial", ## data = dat) ## ## Coefficients: ## (Intercept) pared ## -0.3783 1.1438 ## ## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual ## Null Deviance: 550.5 ## Residual Deviance: 534.1 AIC: 538.1

glm(I(as.numeric(apply) >= 3) ~ pared, family = "binomial", data = dat)

## ## Call: glm(formula = I(as.numeric(apply) >= 3) ~ pared, family = "binomial", ## data = dat) ## ## Coefficients: ## (Intercept) pared ## -2.441 1.094 ## ## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual ## Null Deviance: 260.1 ## Residual Deviance: 252.2 AIC: 256.2

`pared`

is equal to "no" the difference between the predicted value for apply greater than or equal to one and apply greater than or equal to two is roughly 2 (-0.378 - -2.440 = 2.062). For `pared`

equal to "yes" the difference in predicted values for apply greater than or equal to one and apply greater than or equal to 2 is similar (0.765 - -1.347 = 2.112). This suggests that the parallel slopes assumption is reasonable (these differences are what graph below are plotting). Turning our attention to the predictions with `public`

as a predictor variable, we see that when public is set to "no" the difference in predictions for apply greater than or equal to one, versus apply greater than or equal to two is about 2.14 (-0.204 - -2.345 = 2.141). When `public`

is set to "yes" the difference between the coefficients is about 1.37 (-0.175 - -1.547 = 1.372). The differences in the distance between the two sets of coefficients (2.14 vs. 1.37) may suggest that the parallel slopes assumption does not hold for the predictor public. That would indicate that the effect of attending a public versus private school is different for the transition from "unlikely" to "somewhat likely" and "somewhat likely" to "very likely."
The `plot`

command below tells `R`

that the object we wish to plot is `s`

. The command `which=1:3`

is a list of values indicating levels of y should be included in the plot. If your dependent variable had more than three levels you would need to change the 3 to the number of categories (e.g. 4 for a four category variable, even if it is numbered 0, 1, 2, 3). The command `pch=1:3`

selects the markers to use, and is optional, as are `xlab='logit'`

which labels the x-axis, and `main=' '`

which sets the main label for the graph to blank. If the proportional odds assumption holds, for each predictor variable, distance between the symbols for each set of categories of the dependent variable, should remain similar. To help demonstrate this, we normalized all the first set of coefficients to be zero so there is a common reference point. Looking at the coefficients for the variable `pared`

we see that the distance between the two sets of coefficients is similar. In contrast, the distances between the estimates for `public`

are different (i.e. the markers are much further apart on the second line than on the first), suggesting that the proportional odds assumption may not hold.

s[, 4] <- s[, 4] - s[, 3] s[, 3] <- s[, 3] - s[, 3] s # print

## as.numeric(apply) N=400 ## ## +-------+-----------+---+----+----+---------+ ## | | |N |Y>=1|Y>=2|Y>=3 | ## +-------+-----------+---+----+----+---------+ ## |pared |No |337|Inf |0 |-2.062399| ## | |Yes | 63|Inf |0 |-2.112541| ## +-------+-----------+---+----+----+---------+ ## |public |No |343|Inf |0 |-2.140211| ## | |Yes | 57|Inf |0 |-1.371672| ## +-------+-----------+---+----+----+---------+ ## |gpa |[1.90,2.73)|102|Inf |0 |-2.375287| ## | |[2.73,3.00)| 99|Inf |0 |-2.038434| ## | |[3.00,3.28)|100|Inf |0 |-1.890070| ## | |[3.28,4.00]| 99|Inf |0 |-1.864219| ## +-------+-----------+---+----+----+---------+ ## |Overall| |400|Inf |0 |-1.996554| ## +-------+-----------+---+----+----+---------+

plot(s, which = 1:3, pch = 1:3, xlab = "logit", main = " ", xlim = range(s[, 3:4]))

`gpa`

for each level of `pared`

and `public`

and calculate the probability of being in each category of apply. We do this by creating a new dataset of all the values to use for prediction.
newdat <- data.frame(pared = rep(0:1, 200), public = rep(0:1, each = 200), gpa = rep(seq(from = 1.9, to = 4, length.out = 100), 4)) newdat <- cbind(newdat, predict(m, newdat, type = "probs")) ## show first few rows head(newdat)

## pared public gpa unlikely somewhat likely very likely ## 1 0 0 1.900000 0.7376186 0.2204577 0.04192370 ## 2 1 0 1.921212 0.4932185 0.3945673 0.11221424 ## 3 0 0 1.942424 0.7325300 0.2244841 0.04298593 ## 4 1 0 1.963636 0.4866885 0.3984676 0.11484395 ## 5 0 0 1.984848 0.7273792 0.2285470 0.04407383 ## 6 1 0 2.006061 0.4801630 0.4023098 0.11752712

`reshape2`

package and plot all of the predicted probabilities for the different conditions. We plot the predicted probilities, connected with a line, coloured by level of the outcome, `apply`

, and facetted by level of `pared`

and `public`

. We also use a custom label function, to add clearer labels showing what each column and row of the plot represent.
lnewdat <- melt(newdat, id.vars = c("pared", "public", "gpa"), variable.name = "Level", value.name = "Probability") ## view first few rows head(lnewdat)

## pared public gpa Level Probability ## 1 0 0 1.900000 unlikely 0.7376186 ## 2 1 0 1.921212 unlikely 0.4932185 ## 3 0 0 1.942424 unlikely 0.7325300 ## 4 1 0 1.963636 unlikely 0.4866885 ## 5 0 0 1.984848 unlikely 0.7273792 ## 6 1 0 2.006061 unlikely 0.4801630

ggplot(lnewdat, aes(x = gpa, y = Probability, colour = Level)) + geom_line() + facet_grid(pared ~ public, scales = "free", labeller = function(x, y) sprintf("%s = %d", x, y))

- Perfect prediction: Perfect prediction means that one value of a predictor variable is associated with only one value of the response variable. If this happens, Stata will usually issue a note at the top of the output and will drop the cases so that the model can run.
- Sample size: Both ordered logistic and ordered probit, using maximum likelihood estimates, require sufficient sample size. How big is big is a topic of some debate, but they almost always require more cases than OLS regression.
- Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases, the model may become unstable or it might not run at all.
- Pseudo-R-squared: There is no exact analog of the R-squared found in OLS. There are many versions of pseudo-R-squares. Please see Long and Freese 2005 for more details and explanations of various pseudo-R-squares.
- Diagnostics: Doing diagnostics for non-linear models is difficult, and ordered logit/probit models are even more difficult than binary models.

- UCLA: IDRE (Institute for Digital Research and Education). Data Analysis Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
- Agresti, A. (2007). An Introduction to Categorical Data Analysis. Wiley-Interscience.
- Agresti, A. (2002). Categorical Data Analysis, Second Edition. Hoboken, New Jersey: John Wiley & Sons, Inc.
- Liao, T. F. (1994). Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models. Thousand Oaks, CA: Sage Publications, Inc.
- Powers, D. and Xie, Yu. (2008). Statistical Methods for Categorical Data Analysis, Second Edition. Bingley, UK: Emerald Group Publishing Limited.