Robust regression

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations.

library(MASS)
library(foreign)

Introduction and theoretical background

Let's begin our discussion on robust regression with some terms in linear regression.

Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value, i.e., \(u=Y-\hat{Y}\).

Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Robust regression can be used in any situation in which you would use least squares regression. When fitting a least squares regression \[ \sum_{i=1}^n(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})^2 \looparrowright \min_{\boldsymbol\beta}, \] we might find some outliers or high leverage data points. We have decided that these data points are not data entry errors, neither they are from a different population than most of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a form of weighted and reweighted least squares regression.

The rlm command in the MASS package command implements several versions of robust regression. In this page, we will show \(M\)-estimation with Huber and Tukey (aka bisquare weighting). These two are very standard. \(M\)-estimation is optimizing \[ \sum_{i=1}^n\rho(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}) \looparrowright \min_{\boldsymbol\beta} \] with respect to unknown \({\boldsymbol\beta}\), where loss function \(\rho:\mathbb{R}\to\mathbb{R}\) is absolutely continuous, usually convex with derivative \(\psi(x)=d\rho(x)/d x\) (so-called influence function). The \(M\)-estimator of \({\boldsymbol\beta}\) based on the function \(\rho\) is the solution of the following equations \[ \sum_{i=1}^n {\bf X}_{i,\bullet}^{\top}\psi(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \]

If now we define a weight function \(w(x)=\psi(x)/x\), the previous non-linear system of equations becomes so-called estimating equation \[ \sum_{i=1}^n {\bf X}_{i,\bullet}^{\top}(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})w(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \] But the weights depend on the residuals and the residuals on the weights. The equation is solved using Iteratively Reweighted Least Squares (IRLS). If \(r_i=Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}\), then the IRLS problem is \[ \sum_{i=1}^nw(r_I^{(k-1)})r_i^2 \looparrowright \min_{\boldsymbol\beta}, \] where the superscript \((k)\) indicates the iteration number. The weight \(w(r_I^{(k-1)})\) should be recomputed after each iteration in order to be used in the next iteration. For example, the coefficient vector at iteration \(k\) is \[ \widehat{\boldsymbol\beta}^{(k)}=({\bf X}^{\top}{\bf W}^{(k-1)}{\bf X})^{-1}{\bf X}^{\top}{\bf W}^{(k-1)}{\bf Y}. \] The process continues until it converges.

In Huber weighting, observations with small residuals get a weight of 1 and the larger the residual, the smaller the weight. This is defined by the weight function \[ w(x)=\left\{ \begin{array}{cc} 1, & |x|\leq k,\\ k/|x|, & |x|>k. \end{array} \right. \] With Tukey's bisquare weighting, all cases with a non-zero residual get down-weighted at least a little \[ w(x)=\left\{ \begin{array}{cc} [1-(x/c)^2]^2, & |x|\leq c,\\ 0, & |x|>c. \end{array} \right. \]

Crime data

For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to use poverty and single to predict crime.

cdata <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/crime.csv")
summary(cdata)
##       sid           state        crime            murder      
##  Min.   : 1.0   ak     : 1   Min.   :  82.0   Min.   : 1.600  
##  1st Qu.:13.5   al     : 1   1st Qu.: 326.5   1st Qu.: 3.900  
##  Median :26.0   ar     : 1   Median : 515.0   Median : 6.800  
##  Mean   :26.0   az     : 1   Mean   : 612.8   Mean   : 8.727  
##  3rd Qu.:38.5   ca     : 1   3rd Qu.: 773.0   3rd Qu.:10.350  
##  Max.   :51.0   co     : 1   Max.   :2922.0   Max.   :78.500  
##                 (Other):45                                    
##     pctmetro         pctwhite         pcths          poverty     
##  Min.   : 24.00   Min.   :31.80   Min.   :64.30   Min.   : 8.00  
##  1st Qu.: 49.55   1st Qu.:79.35   1st Qu.:73.50   1st Qu.:10.70  
##  Median : 69.80   Median :87.60   Median :76.70   Median :13.10  
##  Mean   : 67.39   Mean   :84.12   Mean   :76.22   Mean   :14.26  
##  3rd Qu.: 83.95   3rd Qu.:92.60   3rd Qu.:80.10   3rd Qu.:17.40  
##  Max.   :100.00   Max.   :98.50   Max.   :86.60   Max.   :26.40  
##                                                                  
##      single     
##  Min.   : 8.40  
##  1st Qu.:10.05  
##  Median :10.90  
##  Mean   :11.33  
##  3rd Qu.:12.05  
##  Max.   :22.10  
## 

Using robust regression analysis

In most cases, we begin by running an OLS regression and doing some diagnostics. We will begin by running an OLS regression and looking at diagnostic plots examining residuals, fitted values, Cook's distance, and leverage.

summary(ols <- lm(crime ~ poverty + single, data = cdata))
## 
## Call:
## lm(formula = crime ~ poverty + single, data = cdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -811.14 -114.27  -22.44  121.86  689.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
## poverty         6.787      8.989   0.755    0.454    
## single        166.373     19.423   8.566 3.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.695 
## F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13

opar <- par(mfrow = c(3, 2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)
plot(ols, which=c(4,6))

plot of chunk robust-diagnostic
From these plots, we can identify observations 9, 25, and 51 as possibly problematic to our model. We can look at these observations to see which states they represent.

cdata[c(9, 25, 51), 1:2]
##    sid state
## 9    9    fl
## 25  25    ms
## 51  51    dc

DC, Florida and Mississippi have either high leverage or large residuals. We can display the observations that have relatively large values of Cook's D. A conventional cut-off point is \(4/n\), where \(n\) is the number of observations in the data set. We will use this criterion to select the values to display.

d1 <- cooks.distance(ols)
r <- stdres(ols)
a <- cbind(cdata, d1, r)
a[d1 > 4/51, ]
##    sid state crime murder pctmetro pctwhite pcths poverty single        d1
## 1    1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3 0.1254750
## 9    9    fl  1206    8.9     93.0     83.5  74.4    17.8   10.6 0.1425891
## 25  25    ms   434   13.5     30.7     63.3  64.3    24.7   14.7 0.6138721
## 51  51    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1 2.6362519
##            r
## 1  -1.397418
## 9   2.902663
## 25 -3.562990
## 51  2.616447

We probably should drop DC to begin with since it is not even a state. We include it in the analysis just to show that it has large Cook's D and demonstrate how it will be handled by rlm. Now we will look at the residuals. We will generate a new variable called absr1, which is the absolute value of the residuals (because the sign of the residual doesn't matter). We then print the ten observations with the highest absolute residual values.

rabs <- abs(r)
a <- cbind(cdata, d1, r, rabs)
asorted <- a[order(-rabs), ]
asorted[1:10, ]
##    sid state crime murder pctmetro pctwhite pcths poverty single
## 25  25    ms   434   13.5     30.7     63.3  64.3    24.7   14.7
## 9    9    fl  1206    8.9     93.0     83.5  74.4    17.8   10.6
## 51  51    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1
## 46  46    vt   114    3.6     27.0     98.4  80.8    10.0   11.0
## 26  26    mt   178    3.0     24.0     92.6  81.0    14.9   10.8
## 21  21    me   126    1.6     35.7     98.5  78.8    10.7   10.6
## 1    1    ak   761    9.0     41.8     75.2  86.6     9.1   14.3
## 31  31    nj   627    5.3    100.0     80.8  76.7    10.9    9.6
## 14  14    il   960   11.4     84.0     81.0  76.2    13.6   11.5
## 20  20    md   998   12.7     92.8     68.9  78.4     9.7   12.0
##            d1         r     rabs
## 25 0.61387212 -3.562990 3.562990
## 9  0.14258909  2.902663 2.902663
## 51 2.63625193  2.616447 2.616447
## 46 0.04271548 -1.742409 1.742409
## 26 0.01675501 -1.460885 1.460885
## 21 0.02233128 -1.426741 1.426741
## 1  0.12547500 -1.397418 1.397418
## 31 0.02229184  1.354149 1.354149
## 14 0.01265689  1.338192 1.338192
## 20 0.03569623  1.287087 1.287087

Now let's run our first robust regression. Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this example. We will then look at the final weights created by the IRLS process. This can be very useful.

summary(rr.huber <- rlm(crime ~ poverty + single, data = cdata))
## 
## Call: rlm(formula = crime ~ poverty + single, data = cdata)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846.09 -125.80  -16.49  119.15  679.94 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1423.0373   167.5899    -8.4912
## poverty         8.8677     8.0467     1.1020
## single        168.9858    17.3878     9.7186
## 
## Residual standard error: 181.8 on 48 degrees of freedom

hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:15, ]
##    state      resid    weight
## 25    ms -846.08536 0.2889618
## 9     fl  679.94327 0.3595480
## 46    vt -410.48310 0.5955740
## 51    dc  376.34468 0.6494131
## 26    mt -356.13760 0.6864625
## 21    me -337.09622 0.7252263
## 31    nj  331.11603 0.7383578
## 14    il  319.10036 0.7661169
## 1     ak -313.15532 0.7807432
## 20    md  307.19142 0.7958154
## 19    ma  291.20817 0.8395172
## 18    la -266.95752 0.9159411
## 2     al  105.40319 1.0000000
## 3     ar   30.53589 1.0000000
## 4     az  -43.25299 1.0000000

We can see that roughly, as the absolute residual goes down, the weight goes up. In other words, cases with a large residuals tend to be down-weighted. This output shows us that the observation for Mississippi will be down-weighted the most. Florida will also be substantially down-weighted. All observations not shown above have a weight of 1. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.

Next, let's run the same model, but using the bisquare weighting function. Again, we can look at the weights.

rr.bisquare <- rlm(crime ~ poverty + single, data = cdata, psi = psi.bisquare)
summary(rr.bisquare)
## 
## Call: rlm(formula = crime ~ poverty + single, data = cdata, psi = psi.bisquare)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -905.59 -140.97  -14.98  114.65  668.38 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1535.3338   164.5062    -9.3330
## poverty        11.6903     7.8987     1.4800
## single        175.9303    17.0678    10.3077
## 
## Residual standard error: 202.3 on 48 degrees of freedom

biweights <- data.frame(state = cdata$state, resid = rr.bisquare$resid, weight = rr.bisquare$w)
biweights2 <- biweights[order(rr.bisquare$w), ]
biweights2[1:15, ]
##    state     resid      weight
## 25    ms -905.5931 0.007652565
## 9     fl  668.3844 0.252870542
## 46    vt -402.8031 0.671495418
## 26    mt -360.8997 0.731136908
## 31    nj  345.9780 0.751347695
## 18    la -332.6527 0.768938330
## 21    me -328.6143 0.774103322
## 1     ak -325.8519 0.777662383
## 14    il  313.1466 0.793658594
## 20    md  308.7737 0.799065530
## 19    ma  297.6068 0.812596833
## 51    dc  260.6489 0.854441716
## 50    wy -234.1952 0.881660897
## 5     ca  201.4407 0.911713981
## 10    ga -186.5799 0.924033113

We can see that the weight given to Mississippi is dramatically lower using the bisquare weighting function than the Huber weighting function and the parameter estimates from these two different weighting methods differ. When comparing the results of a regular OLS regression and a robust regression, if the results are very different, you will most likely want to use the results from the robust regression. Large differences suggest that the model parameters are being highly influenced by outliers. Different functions have advantages and drawbacks. Huber weights can have difficulties with severe outliers, and bisquare weights can have difficulties converging or may yield multiple solutions.

As you can see, the results from the two analyses are fairly different, especially with respect to the coefficients of single and the constant (intercept). While normally we are not interested in the constant, if you had centered one or both of the predictor variables, the constant would be useful. On the other hand, you will notice that poverty is not statistically significant in either analysis, whereas single is significant in both analyses.

Things to consider

References