NMST539, LS 2015/16

Cvičenie 12 (týždeň 14)

Kernel Density and Regression Estimation

Additive Models and Projection Pursuit


Kernel Density Estimation

  • Consider some random sample \(X_{1}, \dots, X_{n}\) from some univariate distribution given by some density \(f(x)\) for \(x \in \mathbb{R}\).
  • In some situations the density can depend on some unknown parameter(s) and thus by estimating the parameter(s) we automatically get the shape of the uknown distribution (density). The key assumption is that the unknown distribution (density) belongs to some class of parametric densities (e.g. normal densities, or exponential densities, etc.).
  • However, in some situations the estimation of the density via estimating parameters is not possible as we either do not know into which parametric family the underlying density belong to or it may not belong to any category at all. In such cases we need to look for alternative approaches to get the estimate of the unknown distribution.
  • One (a little rough) approach is tu use a regular histogram with some bin wide small enough. This is however possible only for the sample size large enough as we can not get bins with no observations belonging to them.
  • A naive approach based on maximum likelihood fails for density estimation \(\Rightarrow\) a so called Dirac catastrophe.
  • Can we do somehow better?
rm(list = ls())
require(graphics)
summary(faithful)
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0
hist(faithful[,1], col = "lightblue", main = "Histogram for an eruption duration", xlab = "Eruption duration [min]")

And given the fact that we have quite enough observations (\(n = 272\)) we can smaller the bin size (less observations will get into each bin but more sensitivity we get with respect to the variability of the density).

par(mfrow = c(2,2))
hist(faithful[,1], col = "lightblue", main = "Histogram for an eruption duration (15)", xlab = "Eruption duration [min]", breaks = 15)
hist(faithful[,1], col = "lightblue", main = "Histogram for an eruption duration (25)", xlab = "Eruption duration [min]", breaks = 25)
hist(faithful[,1], col = "lightblue", main = "Histogram for an eruption duration (50)", xlab = "Eruption duration [min]", breaks = 50)
hist(faithful[,1], col = "lightblue", main = "Histogram for an eruption duration (200)", xlab = "Eruption duration [min]", breaks = 200)

Usually we assume that the unknown density function is smooth. Using the histogram approach one can never get a smooth estimate of the density. Another statistical approaches need to be considered. This is where we use the kernel based density estimation.

  • There are many different approaches for the kernel density estimation in statistis. Let us mention at least the one which is the most popular.
  • The KDE is defined by the expression \[\hat{f}(x) = \frac{1}{n h_{n}} \sum_{i = 1}^{n} K \Big(\frac{x - X_{i}}{h_{n}}\Big)\],
  • where \(h_{n} \to 0\) as \(n \to \infty\) however, not too fast. We need that \(n h_{n} \to \infty\). Function \(K()\) is a kernel function (usually a positive and symetric function) defined on interval \([-1, 1]\).

  • In order to get the estimate of the unknown density function one needs to run the estimation procedure above for each \(x \in \mathcal{D}\), which is the domain of the unknown density fucntion.

A standard function in R for obtaining a KDE is density() which is available under the standard R installation.

plot(density(faithful[,1]), col = "red", lwd = 2)

And for different values of the bandwidth parameter

par(mfrow = c(2,2))
plot(density(faithful[,1], adjust = 0.1), col = "red", lwd = 2, Main = "KDE 1")
plot(density(faithful[,1], adjust = 0.5), col = "red", lwd = 2, Main = "KDE 2")
plot(density(faithful[,1], adjust = 1), col = "red", lwd = 2, Main = "KDE 3")
plot(density(faithful[,1], adjust = 2), col = "red", lwd = 2, Main = "KDE 4")

Questions and ToDos


  • Try to program a piece of R code which will calculated the KDE for some specific kernel functions (free of choice but two different one at least).
  • Using the code you prepared try to use different values of the bandwidth parameter and discuss what is more important in the kernel density estimation: the value of the bandwidth parameter, or the choice of the kernel function?
  • If there is an (easy) approach to estimate any possible (free of shape) density function why do we still prefer in some situations the classical estimation based on parameter estimates which are plugged into the corresponding parametric shape of the density?



Kernel Regression Estimation

A straightforward extention of the idea of the kernel density estimation can also applied in case of the regression estimation. In a classical parametric regression (either linear or nonlinear) we a-priori assume some specific parametric shape of the uknown regression function. One can either have some prior knowledge or some previous experiences for example. In some other situations, however, it is not possible to assume any parametric shape or the unknown regression function can not be expressed using a parametric prescription. In such situations, again a kernel regression approach is suitable.

  • Nadaraya-Watson KDE - the idea is to use a shifting kernel window of a fixed length controled by a value of the bandwidth parameter. To be specific, the density estimate is given by the following expression: \[\hat{f}(x) = \frac{1}{n h_{n}} \sum_{i = 1}^{n} K \Big(\frac{x - X_{i}}{h_{n}}\Big)\],
  • where \(h_{n} \to 0\) as \(n \to \infty\) however, not too fast. We need that \(n h_{n} \to \infty\). Function \(K()\) is a kernel function (usually a positive and symetric function) defined on interval \([-1, 1]\).
  • In order to get the estimate of the unknown density function one needs to run the estimation procedure above for each \(x \in \mathcal{D}\), which is the domain of the unknown density fucntion.

  • Gasser-Müller KDE - the idea is the same as for the Nadaraya-Watson KDE but the formula to obtain the estimate is different. It takes the form \[ \hat{f}(x) = \frac{1}{h_{n}} \sum_{i = 1}^{n} \int_{s_{i - 1}}^{s_{i}} K \Big(\frac{x - u}{h_{n}}\Big) \mbox{d}u \]
  • Priestley–Chao KDE - the idea is again the sa as for the Nadaraya-Watson KDE but the formula to obtain the estimate is different. It takes the form \[ \hat{f}(x) = \frac{1}{h_{n}} \sum_{i = 1}^{n} (x_{i} - x_{i = 1}) K \Big(\frac{x - X_{i}}{h_{n}}\Big)\]

In the R environment there is a standard function ksmooth() which can be used to fit the Nadaraya-Watson kernel estimate.

attach(faithful)
plot(faithful[,1] ~ faithful[,2], pch = 22, bg = "gray", xlab = "Waiting time [min]", ylab = "Eruption duration [min]")
lines(ksmooth(faithful[,2], faithful[,1], kernel = "normal", bandwidth = 9), col = "red", lwd = 2)
xseq <- seq(40, 100, length = 1000)
m1 <- lm(faithful[,1] ~ faithful[,2])
m2 <- lm(eruptions   ~ waiting + I(waiting^2))
m3 <- lm(eruptions   ~ waiting + I(waiting^2) + I(waiting^3))
abline(m1$coeff[1], m1$coeff[2], col = "orange", lwd = 2, lty = 2)
lines((m2$coeff[1] + m2$coeff[2] * xseq + m2$coeff[3] * xseq^2) ~ xseq, col = "violet", lwd = 2, lty = 2)
lines((m3$coeff[1] + m3$coeff[2] * xseq + m3$coeff[3] * xseq^2 + m3$coeff[4] * xseq^3) ~ xseq, col = "blue", lwd = 2, lty = 2)
legend(43, 5, legend = c("Kernel Regression", "Linear Regression", "Quadratic Regression", "Cubic Regression"), 
              col = c("red", "orange", "violet", "blue"), lty = c(1,2,2,2), lwd = c(2,2,2,2))



Questions and ToDos


  • Try to program a piece of R code which will calculated the Nadaraya-Watson regression Estimate.
  • Based on your exprerience from the kernel density estimate what do you expect your regarding the importance of the bandwidth parameter choice and the kernel function selection?
  • Try to discuss the main cons & prons of the nonparametric (kernel) regression estimation.



Additive Models

Let us now go back into the world of multivariate statistics as we briefly discuss additive models here. Let as assume a random sample \(\{(\boldsymbol{X}_{i}, Y_{i});~ i = 1, \dots, n\}\), where \(\boldsymbol{X}_{i} \in \mathbb{R}^{p}\) for some \(p \in \mathbb{N}\). The sample is considered to be drawn from some unknown population \((\mathcal{X}, \mathcal{Y})\) where we assume that there is a common relationship between \(Y\) and \(\boldsymbol{X}\) which can be expressed as \[ Y_{i} = f(\boldsymbol{X}_{i}) + \varepsilon_{i}, \] for some random error terms (e.g. \(\varepsilon_i \sim N(0, \sigma^2)\)) and some unknown regression function \(f:~\mathbb{R}^{p} \longrightarrow \mathbb{R}\). The idea is to estimate the unknown function \(f(\cdot)\).

Additive Decomposition Assumption

Instead of fitting the \(p\)-variate function \(f\) we assume that it can be decomposed into \(p\) elements as \[ f(X_{i 1}, \dots, X_{i p}) = \sum_{j = 1}^{p} f_{j}(X_{i j}) ,\] where the additive components are again assumed to be unknown functions which need to be estimated however, under the additive decomposition assumtion for each covariate we have its own functional component which is univariate already

Note, that under additional assumtion that each functional component is parametric for every \(j = 1, \dots, p\) then we are back with the classical parametric regression problem. On the other hand, if we allow for additional decompostion, we can consider the functional form given by as

\[ f(X_{i 1}, \dots, X_{i p}) = \sum_{j = 1}^{p} f_{j}(X_{i j}) + \sum_{j \neq k} f_{j k}(X_{i j}, X_{i k}).\]

And again, it is easy to see, that one we restric to parametric shapes of all functional components in the expansion above we get a classical linear regression setup with interactions.

For fitting the additive models in R one can use the R package ‘gam’ (Generalized Additive Models, previously available in the R package ‘mgcv’) which needs to be installed firstly by using install.packages("gam") (or install.packages("mgcv") respectively). The corresponding function which fits the additive model to the available data is gam().

library(gam)
summary(gm1 <- gam(mpg ~ s(qsec) + s(hp) + wt + as.factor(cyl), data = mtcars))
## 
## Call: gam(formula = mpg ~ s(qsec) + s(hp) + wt + as.factor(cyl), data = mtcars)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2851 -1.0471 -0.3739  0.7527  3.9300 
## 
## (Dispersion Parameter for gaussian family taken to be 4.8211)
## 
##     Null Deviance: 1126.047 on 31 degrees of freedom
## Residual Deviance: 96.4213 on 19.9999 degrees of freedom
## AIC: 152.108 
## 
## Number of Local Scoring Iterations: 3 
## 
## Anova for Parametric Effects
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## s(qsec)         1 192.79  192.79 39.9882 3.583e-06 ***
## s(hp)           1 448.93  448.93 93.1171 5.743e-09 ***
## wt              1 270.19  270.19 56.0429 3.200e-07 ***
## as.factor(cyl)  2  11.28    5.64  1.1697    0.3308    
## Residuals      20  96.42    4.82                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F   Pr(F)  
## (Intercept)                            
## s(qsec)              3 1.8790 0.16567  
## s(hp)                3 4.0482 0.02117 *
## wt                                     
## as.factor(cyl)                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Going back to the example with ‘mtcars’ dataset…

par(mfrow = c(2,2))
plot(gm1, residuals = T, all.terms=TRUE, shade=TRUE)

Questions and ToDos


  • How would you interpret the model above? What is would be the conclusion?
  • Under which circumstances are you able to predict an expected milage for some new car if you are provided with information on the covariates included in the model?
  • In a similar manner as you did in case of the kernel regression estimation discuss again what is this approach suitable for and which situations it is not to be used at all.
  • Try fit some other additive model and try to see the form of the dependence in the data. What is the interpretation of your model?
  • There are more packages available in the R environment which can fit additive models.



Projection Pursuit Regression

Projection Pursuit regression is a special case respectively a generalization of the additive regression model where instead of the classical decompostion

\[ f(X_{i 1}, \dots, X_{i p}) = \sum_{j = 1}^{p} f_{j}(X_{i j}) ,\]

where for each covariate \(X_{j}\) we had one functional component we rather assume an additive decomposition of the form

\[ f(X_{i 1}, \dots, X_{i p}) = \sum_{j = 1}^{k} f_{j}(\boldsymbol{\beta}_{j}^{\top}\boldsymbol{X}_{i}).\]

Note, that the functional components in the additive decomposition above are again just univariate unknown functions which we want to estimate but in this case the interpretation is not with respect to one covariate only but it rather assumes some “suitable” linear transformation of the original covariates. The number of additive components does not have to be necessarily equal to \(p\) in order to have an effect of each covariate in the model.

In the R environment, the command which is responsible for fitting a projection pursuit method is ppr() and it is again available under the standard R installation.

summary(ppr1 <- ppr(mpg ~ qsec + hp + wt + as.factor(cyl), nterms = 4, max.terms = 8, data = mtcars))
## Call:
## ppr(formula = mpg ~ qsec + hp + wt + as.factor(cyl), data = mtcars, 
##     nterms = 4, max.terms = 8)
## 
## Goodness of fit:
##    4 terms    5 terms    6 terms    7 terms    8 terms 
## 11.6587585  9.5717025  4.0001836  1.7366111  0.7129226 
## 
## Projection direction vectors:
##                 term 1       term 2       term 3       term 4      
## qsec             0.056717199  0.406561071 -0.150272127 -0.111513055
## hp              -0.007362921  0.015215429 -0.020306348  0.005327756
## wt              -0.941871773 -0.703708361  0.984375509  0.363879588
## as.factor(cyl)4  0.195263570 -0.480896694  0.081497653  0.896864284
## as.factor(cyl)6 -0.222662623  0.312422235  0.026055465  0.224692053
## as.factor(cyl)8 -0.147986503  0.101989434 -0.026268649 -0.016614364
## 
## Coefficients of ridge terms:
##   term 1   term 2   term 3   term 4 
## 6.193931 1.922542 1.797803 1.264184
par(mfrow = c(2,2))
plot(ppr1)

  • Any reasonable interpretation of this model?