Truncated regression is used to model dependent variables for which some of the observations are not included in the analysis because of the value of the dependent variable.
require(foreign) library(truncreg)
library(boot) library(ggplot2)
Example 1. A study of students in a special GATE (gifted and talented education) program wishes to model achievement as a function of language skills and the type of program in which the student is currently enrolled. A major concern is that students are required to have a minimum achievement score of 40 to enter the special program. Thus, the sample is truncated at an achievement score of 40.
Example 2. A researcher has data for a sample of Americans whose income is above the poverty line. Hence, the lower part of the distribution of income is truncated. If the researcher had a sample of Americans whose income was at or below the poverty line, then the upper part of the income distribution would be truncated. In other words, truncation is a result of sampling only part of the distribution of the outcome variable.
Let's pursue Example 1 from above. We have a hypothetical data file, truncreg.dta
, with 178 observations. The outcome variable is called achiv
, and the language test score variable is called langscore
. The variable prog
is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. Let's look at the data. It is always a good idea to start with descriptive statistics.
dat <- read.dta("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/truncreg.dta") summary(dat)
## id achiv langscore prog ## Min. : 3.00 Min. :41.00 Min. :31.00 general : 40 ## 1st Qu.: 55.25 1st Qu.:47.00 1st Qu.:47.50 academic:101 ## Median :102.50 Median :52.00 Median :56.00 vocation: 37 ## Mean :103.62 Mean :54.24 Mean :54.01 ## 3rd Qu.:151.75 3rd Qu.:63.00 3rd Qu.:61.75 ## Max. :200.00 Max. :76.00 Max. :67.00
# histogram of achiv coloured by program type ggplot(dat, aes(achiv, fill = prog)) + geom_histogram(binwidth = 3)
# boxplot of achiv by program type ggplot(dat, aes(prog, achiv)) + geom_boxplot() + geom_jitter()
ggplot(dat, aes(x = langscore, y = achiv)) + geom_point() + stat_smooth(method = "loess") + facet_grid(. ~ prog, margins=TRUE)
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.
Below we use the truncreg
function in the truncreg
package to estimate a truncated regression model. The point
argument indicates where the data are truncated, and the direction indicates whether it is left or right truncated.
m <- truncreg(achiv ~ langscore + prog, data = dat, point = 40, direction = "left") summary(m)
## ## Call: ## truncreg(formula = achiv ~ langscore + prog, data = dat, point = 40, ## direction = "left") ## ## ## Coefficients : ## Estimate Std. Error t-value Pr(>|t|) ## (Intercept) 11.29942 6.77173 1.6686 0.09519 . ## langscore 0.71267 0.11446 6.2264 4.773e-10 *** ## progacademic 4.06267 2.05432 1.9776 0.04797 * ## progvocation -1.14422 2.66958 -0.4286 0.66821 ## sigma 8.75368 0.66647 13.1343 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log-Likelihood: -591.31 on 5 Df
langscore
is statistically significant. A unit increase in language score leads to a .71 increase in predicted achievement. One of the indicator variables for prog
is also statistically significant. Compared to general programs, academic programs are about 4.07 higher. To determine if prog
itself is statistically significant, we can test models with it in and out for the two degree-of-freedom test of this variable.# update old model dropping prog m2 <- update(m, . ~ . - prog) pchisq(-2 * (logLik(m2) - logLik(m)), df = 2, lower.tail = FALSE)
## 'log Lik.' 0.02516651 (df=3)
prog
is a statistically significant predictor of achiv
. We can get the expected means for each program at the mean of langscore
by reparameterizing the model.
# create mean centered langscore to use later dat <- within(dat, { mlangscore <- langscore - mean(langscore) }) malt <- truncreg(achiv ~ 0 + mlangscore + prog, data = dat, point = 40) summary(malt)
## ## Call: ## truncreg(formula = achiv ~ 0 + mlangscore + prog, data = dat, ## point = 40) ## ## ## Coefficients : ## Estimate Std. Error t-value Pr(>|t|) ## mlangscore 0.71259 0.11448 6.2248 4.82e-10 *** ## proggeneral 49.78926 1.89714 26.2443 < 2.2e-16 *** ## progacademic 53.85340 1.15012 46.8242 < 2.2e-16 *** ## progvocation 48.65315 2.14049 22.7299 < 2.2e-16 *** ## sigma 8.75545 0.66684 13.1299 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Log-Likelihood: -591.31 on 5 Df
langscore
is at its mean for each type of program.
We could also calculate the bootstrapped confidence intervals if we wanted to. First, we define a function that returns the parameters of interest, and then use the boot
function to run the bootstrap.
f <- function(data, i) { require(truncreg) m <- truncreg(formula = achiv ~ langscore + prog, data = data[i, ], point = 40) as.vector(t(summary(m)$coef[, 1:2])) } set.seed(10) (res <- boot(dat, f, R = 1200, parallel = "snow", ncpus = 4))
## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = dat, statistic = f, R = 1200, parallel = "snow", ## ncpus = 4) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 11.2994158 0.3039135945 5.93277633 ## t2* 6.7717257 -0.0559320949 0.86199032 ## t3* 0.7126732 -0.0040554268 0.09651639 ## t4* 0.1144602 -0.0006848377 0.01372122 ## t5* 4.0626698 -0.0536791573 2.03318632 ## t6* 2.0543191 -0.0014251852 0.24110408 ## t7* -1.1442162 0.0213603619 2.87208656 ## t8* 2.6695799 0.0121405706 0.29443074 ## t9* 8.7536778 -0.1094948867 0.54996087 ## t10* 0.6664744 -0.0109213295 0.07535149
boot.ci
function.
# basic parameter estimates with percentile and bias adjusted CIs parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca")) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4], bcaLL = bca[5])) })) # add row names row.names(parms) <- names(coef(m)) # print results parms
## Est pLL pUL bcaLL bcaLL ## (Intercept) 11.2994158 -1.25801473 22.2970167 -3.72311743 21.3202749 ## langscore 0.7126732 0.53860641 0.9159501 0.55079089 0.9435087 ## progacademic 4.0626698 0.05802952 8.0112185 0.08421688 8.0425925 ## progvocation -1.1442162 -6.80545872 4.2770202 -6.84361734 4.2502140 ## sigma 8.7536778 7.67372151 9.7919028 7.88960453 10.1104743
achiv
with the predicted value and squaring the result.
dat$yhat <- fitted(m) # correlation (r <- with(dat, cor(achiv, yhat)))
## [1] 0.5524392
# rough variance accounted for r^2
## [1] 0.305189
truncreg
function is designed to work when the truncation is on the outcome variable in the model. It is possible to have samples that are truncated based on one or more predictors. For example, modeling college GPA as a function of high school GPA (HSGPA) and SAT scores involves a sample that is truncated based on the predictors, i.e., only students with higher HSGPA and SAT scores are admitted into the college.point
= 39 instead of point
= 40, the results would have been slightly different. It does not matter that there were no values of 40 in our sample.