The `library()`

function is used to load *libraries*, or groups of functions and data sets that are not included in the base `R`

distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the `MASS`

package, which is a very large collection of data sets and functions. We also load the `ISLR2`

package, which includes the data sets associated with this book.

```
library(MASS)
library(ISLR2)
```

If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as `MASS`

, come with `R`

and do not need to be separately installed on your computer. However, other packages, such as `ISLR2`

, must be downloaded the first time they are used. This can be done directly from within `R`

. For example, on a Windows system, select the `Install package`

option under the `Packages`

tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and `R`

will automatically download the package. Alternatively, this can be done at the `R`

command line via `install.packages("ISLR2")`

. This installation only needs to be done the first time you use a package. However, the `library()`

function must be called within each `R`

session.

The `ISLR2`

library contains the `Boston`

data set, which records `medv`

(median house value) for \(506\) census tracts in Boston. We will seek to predict `medv`

using \(12\) predictors such as `rmvar`

(average number of rooms per house), `age`

(average age of houses), and `lstat`

(percent of households with low socioeconomic status).

`head(Boston)`

```
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
```

To find out more about the data set, we can type `?Boston`

.

We will start by using the `lm()`

function to fit a simple linear regression model, with `medv`

as the response and `lstat`

as the predictor. The basic syntax is {}, where `y`

is the response, `x`

is the predictor, and `data`

is the data set in which these two variables are kept.

`lm.fit <- lm(medv ~ lstat)`

`## Error in eval(predvars, data, env): object 'medv' not found`

The command causes an error because `R`

does not know where to find the variables `medv`

and `lstat`

. The next line tells `R`

that the variables are in `Boston`

. If we attach `Boston`

, the first line works fine because `R`

now recognizes the variables.

```
lm.fit <- lm(medv ~ lstat, data = Boston)
attach(Boston)
lm.fit <- lm(medv ~ lstat)
```

If we type `lm.fit`

, some basic information about the model is output. For more detailed information, we use `summary(lm.fit)`

. This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the model.

`lm.fit`

```
##
## Call:
## lm(formula = medv ~ lstat)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
```

`summary(lm.fit)`

```
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
```

We can use the `names()`

function in order to find out what other pieces of information are stored in `lm.fit`

. Although we can extract these quantities by nameâ€”e.g. `lm.fit$coefficients`

â€”it is safer to use the extractor functions like `coef()`

to access them.

`names(lm.fit)`

```
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
```

`coef(lm.fit)`

```
## (Intercept) lstat
## 34.5538409 -0.9500494
```

In order to obtain a confidence interval for the coefficient estimates, we can use the `confint()`

command. %Type `confint(lm.fit)`

at the command line to obtain the confidence intervals.

`confint(lm.fit)`

```
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
```

The `predict()`

function can be used to produce confidence intervals and prediction intervals for the prediction of `medv`

for a given value of `lstat`

.

```
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
interval = "confidence")
```

```
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
```

```
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
interval = "prediction")
```

```
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
```

For instance, the 95,% confidence interval associated with a `lstat`

value of 10 is \((24.47, 25.63)\), and the 95,% prediction interval is \((12.828, 37.28)\). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of \(25.05\) for `medv`

when `lstat`

equals 10), but the latter are substantially wider.

We will now plot `medv`

and `lstat`

along with the least squares regression line using the `plot()`

and `abline()`

functions.

```
plot(lstat, medv)
abline(lm.fit)
```

There is some evidence for non-linearity in the relationship between `lstat`

and `medv`

. We will explore this issue later in this lab.

The `abline()`

function can be used to draw any line, not just the least squares regression line. To draw a line with intercept `a`

and slope `b`

, we type `abline(a, b)`

. Below we experiment with some additional settings for plotting lines and points. The `lwd = 3`

command causes the width of the regression line to be increased by a factor of 3; this works for the `plot()`

and `lines()`

functions also. We can also use the `pch`

option to create different plotting symbols.

```
plot(lstat, medv)
abline(lm.fit, lwd = 3)
abline(lm.fit, lwd = 3, col = "red")
```