Regression

Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables. Perhaps the easiest way of knowing when regression is the appropriate analysis is to see that a scatterplot is the appropriate graphic. [The R Book, Crawley]

Linear Regression

The essence of regression analysis is using sample data to estimate parameter values and their standard errors. First, however, we need to select a model which describes the relationship between the response variable and the explanatory variable(s). The simplest of all is the linear model $y = a + b.x$

There are two variables and two parameters. The response variable is y, and x is a single continuous explanatory variable. The parameters are a and b: the intercept is a (the value of y when x = 0); and the slope is b (the change in y divided by the change in x which brought it about).

reg.data<-read.table("regression.txt",header=TRUE)
reg.data
##   growth tannin
## 1     12      0
## 2     10      1
## 3      8      2
## 4     11      3
## 5      6      4
## 6      7      5
## 7      2      6
## 8      3      7
## 9      3      8
attach(reg.data)

plot(tannin,growth,pch=16)
# To find the linear regression fit use 'lm'
# The response variable comes first (growth in our example), then the tilde ~,
# then the name of the continuous explanatory variable (tannin).
fit <- lm(growth~tannin)

fit$coefficients ## (Intercept) tannin ## 11.755556 -1.216667 # draw the best fit line abline(fit,col="red") The values of the coefficients mean that the maximal likehood estimation to the equation that minimizes the overall error is $\text{growth} = 11.756 - 1.217 * \text{tannin}$ ie, the line which is the best fit for the given dataset in the sense that it is the most probable line among all options. The difference between a measured value of y and the value predicted by the model for the same value of x is called a residual. Let’s see the residuals: predt <- function(fit, x) { # hand-made prediction function return (fit$coefficients[[1]] + x * fit$coefficients[[2]]) } plot(tannin,growth,pch=16) abline(fit,col="red") segments(tannin,predt(fit,tannin),tannin,growth,col="red",lty=2) # their specific values can also be accessed by this component: fit$residuals
##          1          2          3          4          5          6
##  0.2444444 -0.5388889 -1.3222222  2.8944444 -0.8888889  1.3277778
##          7          8          9
## -2.4555556 -0.2388889  0.9777778

For these regression models there are several important assumptions: + The variance in y is constant (i.e. the variance does not change as y gets bigger). + The explanatory variable, x, is measured without error. + Residuals are measured on the scale of y (i.e. parallel to the y axis). + The residuals are normally distributed.

Under these assumptions, the maximum likelihood is given by the method of least squares, ie, minimizing the sum of the squared errors (the residuals).

Confidence Intervals

In statistics, a confidence interval (CI) is a measure of the reliability of an estimate. It is a type of interval estimate of a population parameter. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level wikipedia

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
pl <- iris$Petal.Length sl <- iris$Sepal.Length
model <- lm(pl ~ sl)
model
##
## Call:
## lm(formula = pl ~ sl)
##
## Coefficients:
## (Intercept)           sl
##      -7.101        1.858
pred <- predict(model, data.frame(sl = sort(sl)), level = 0.95, interval = "confidence")
head(pred)
##         fit       lwr      upr
## 1 0.8898184 0.5928870 1.186750
## 2 1.0756617 0.7935781 1.357745
## 3 1.0756617 0.7935781 1.357745
## 4 1.0756617 0.7935781 1.357745
## 5 1.2615050 0.9940171 1.528993
## 6 1.4473483 1.1941605 1.700536
lower <- pred[,2]
upper <- pred[,3]

# make plot
plot(sl, pl, pch=19, type="n")
grid()
polygon(c(sort(sl), rev(sort(sl))), c(upper, rev(lower)), col = "gray", border = "gray")
points(sl, pl, pch=19, cex=0.6)
abline(model, col="red", lwd=2)

Multi-linear Regression

Most of the times, we have a dataset with more than one predictor $$x_i$$. One way is to apply simple linear regression to each predictor but this implies that each predictor does not care about the others which is not feasible. A better way is to fit all of them together:

$y = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n + \epsilon$

Function lm is able to do it:

marks.set <- read.csv(file="Advertising.csv", head=TRUE, sep=',')
head(marks.set)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
fit <- lm(Sales ~ TV+Radio, data=marks.set) # fitting TV and Radio
summary(fit)
##
## Call:
## lm(formula = Sales ~ TV + Radio, data = marks.set)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.7977 -0.8752  0.2422  1.1708  2.8328
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## Radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16
fit2 <- lm(Sales ~ ., data=marks.set) # all features are predictors
summary(fit2)
##
## Call:
## lm(formula = Sales ~ ., data = marks.set)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.8277 -0.8908  0.2418  1.1893  2.8292
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Some predictors might have stronger correlations than others:

cor(marks.set)
##                   TV      Radio  Newspaper     Sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## Radio     0.05480866 1.00000000 0.35410375 0.5762226
## Newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## Sales     0.78222442 0.57622257 0.22829903 1.0000000

In multi-linear regression a coefficient may not produce an effect because it is considered in the context of other coefficients. In this dataset, newspaper seems like an eg.

We can check for interactions, ie,

fit3 <- lm(Sales ~ TV*Radio, data=marks.set) # use TV, Radio and its interaction
summary(fit3)
##
## Call:
## lm(formula = Sales ~ TV * Radio, data = marks.set)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -6.3366 -0.4028  0.1831  0.5948  1.5246
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 **
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Notice that this is no longer a simple linear relationship.

We should check for correlations between relevant coefficients (herein TV and Radio) to see if they complement each other in the estimation (low correlation) or just mirror the same effect (because they have high correlation). In this case, since Radio and TV have low correlation, it seems that a joint marketing in TV and Radio has some sinergetic effects.

Another note: we should be careful when deciding to remove coefficients because of their perceived contributions. The hierarchy principle says the following:

If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Polynomial Regression

Instead of using a line, we can generalize for polynomials. However, Polynomials are extremely flexible and the next example of approaching a sine wave (using just Calculus):

Using Maclaurin expansion we know that sin(x) is given by expression

$x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + ...$

Let’s do this in R:

x<-seq(-pi,pi,0.01)
y<-sin(x)
plot(x,y,type="l",ylab="sin(x)",lwd=2)

a1<-x-x^3/factorial(3)
lines(x,a1,col="green")  # 1st approximation

a2<-x-x^3/factorial(3)+x^5/factorial(5)
lines(x,a2,col="red")    # 2nd approximation

a3<-x-x^3/factorial(3)+x^5/factorial(5)-x^7/factorial(7)
lines(x,a3,col="blue")   # 3rd approximation

This is both a strength and a problem. If we use high-degree polynomials we risk overfitting, which is something we wish to avoid.

poly<-read.table("diminish.txt",header=TRUE)
attach(poly)
head(poly)
##   xv yv
## 1  5 26
## 2 10 29
## 3 15 31
## 4 20 30
## 5 25 35
## 6 30 37
plot(xv,yv,pch=16)
model1<-lm(yv~xv)  # fitting a linear model
abline(model1,col="red")

summary(model1)
##
## Call:
## lm(formula = yv ~ xv)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.3584 -2.1206 -0.3218  1.8763  4.1782
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.61438    1.17315   23.54 7.66e-14 ***
## xv           0.22683    0.02168   10.46 1.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.386 on 16 degrees of freedom
## Multiple R-squared:  0.8725, Adjusted R-squared:  0.8646
## F-statistic: 109.5 on 1 and 16 DF,  p-value: 1.455e-08
model2<-lm(yv~xv+I(xv^2)) # Fitting a quadratic polynomial
x<-0:90
y<-predict(model2,list(xv=x))
plot(xv,yv,pch=16)
lines(x,y,col="red")

model3<-lm(yv~xv+I(xv^2)+I(xv^3)+I(xv^4)+I(xv^5)) # Fitting a quintic polynomial
x<-0:90
y<-predict(model3,list(xv=x))
plot(xv,yv,pch=16)
lines(x,y,col="red")

Another R tool is the function poly() (eg from Conway’s “Machine Learning for Hackers”, chapter 6):

library(ggplot2)
set.seed(1)
x <- seq(0, 1, by = 0.01)
y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
ggplot(df, aes(x = X, y = Y)) +
geom_point()

# using a linear regression:
model <- lm(Y ~ X, data = df)
summary(model)
##
## Call:
## lm(formula = Y ~ X, data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.00376 -0.41253 -0.00409  0.40664  0.85874
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.94111    0.09057   10.39   <2e-16 ***
## X           -1.86189    0.15648  -11.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4585 on 99 degrees of freedom
## Multiple R-squared:  0.5885, Adjusted R-squared:  0.5843
## F-statistic: 141.6 on 1 and 99 DF,  p-value: < 2.2e-16
# it's possible to explain 58.9% of the variance
ggplot(data.frame(X = x, Y = y), aes(x = X, y = Y)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)

# here's the polynomial regression with poly()
model2 <- lm(Y ~ poly(X, degree = 3), data = df)
summary(model2)
##
## Call:
## lm(formula = Y ~ poly(X, degree = 3), data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.32331 -0.08538  0.00652  0.08320  0.20239
##
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)           0.01017    0.01148   0.886    0.378
## poly(X, degree = 3)1 -5.45536    0.11533 -47.302   <2e-16 ***
## poly(X, degree = 3)2 -0.03939    0.11533  -0.342    0.733
## poly(X, degree = 3)3  4.41805    0.11533  38.308   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1153 on 97 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9737
## F-statistic:  1235 on 3 and 97 DF,  p-value: < 2.2e-16
df <- transform(df, PredictedY = predict(model2))
ggplot(df, aes(x = X, y = PredictedY)) +
geom_point() +
geom_line()

# if we exagerate in the polynomial degree, things get overfitted:
model3 <- lm(Y ~ poly(X, degree = 25), data = df)
summary(model3)
##
## Call:
## lm(formula = Y ~ poly(X, degree = 25), data = df)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.211435 -0.040653  0.005966  0.040225  0.221436
##
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             1.017e-02  9.114e-03   1.116   0.2682
## poly(X, degree = 25)1  -5.455e+00  9.159e-02 -59.561  < 2e-16 ***
## poly(X, degree = 25)2  -3.939e-02  9.159e-02  -0.430   0.6684
## poly(X, degree = 25)3   4.418e+00  9.159e-02  48.236  < 2e-16 ***
## poly(X, degree = 25)4  -4.797e-02  9.159e-02  -0.524   0.6020
## poly(X, degree = 25)5  -7.065e-01  9.159e-02  -7.713 4.19e-11 ***
## poly(X, degree = 25)6  -2.042e-01  9.159e-02  -2.230   0.0288 *
## poly(X, degree = 25)7  -5.134e-02  9.159e-02  -0.561   0.5768
## poly(X, degree = 25)8  -3.100e-02  9.159e-02  -0.338   0.7360
## poly(X, degree = 25)9   7.723e-02  9.159e-02   0.843   0.4018
## poly(X, degree = 25)10  4.809e-02  9.159e-02   0.525   0.6011
## poly(X, degree = 25)11  1.300e-01  9.159e-02   1.419   0.1600
## poly(X, degree = 25)12  2.473e-02  9.159e-02   0.270   0.7879
## poly(X, degree = 25)13  2.371e-02  9.159e-02   0.259   0.7965
## poly(X, degree = 25)14  8.791e-02  9.159e-02   0.960   0.3403
## poly(X, degree = 25)15 -1.346e-02  9.159e-02  -0.147   0.8836
## poly(X, degree = 25)16 -2.613e-05  9.159e-02   0.000   0.9998
## poly(X, degree = 25)17  3.501e-02  9.159e-02   0.382   0.7034
## poly(X, degree = 25)18 -1.372e-01  9.159e-02  -1.498   0.1384
## poly(X, degree = 25)19 -6.913e-02  9.159e-02  -0.755   0.4528
## poly(X, degree = 25)20 -6.793e-02  9.159e-02  -0.742   0.4606
## poly(X, degree = 25)21 -1.093e-01  9.159e-02  -1.193   0.2365
## poly(X, degree = 25)22  1.367e-01  9.159e-02   1.493   0.1397
## poly(X, degree = 25)23 -3.921e-02  9.159e-02  -0.428   0.6698
## poly(X, degree = 25)24 -5.032e-02  9.159e-02  -0.549   0.5844
## poly(X, degree = 25)25  1.262e-01  9.159e-02   1.378   0.1722
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09159 on 75 degrees of freedom
## Multiple R-squared:  0.9876, Adjusted R-squared:  0.9834
## F-statistic: 238.1 on 25 and 75 DF,  p-value: < 2.2e-16
df <- transform(df, PredictedY = predict(model3))
ggplot(df, aes(x = X, y = PredictedY)) +
geom_point() +
geom_line()

Fitting a mechanistic model to the available data

Rather than fitting some arbitrary model for curvature (as above, with a quadratic term for inputs), we sometimes have a mechanistic model relating the value of the response variable to the explanatory variable (e.g. a mathematical model of a physical process). In the following example we are interested in the decay of organic material in soil, and our mechanistic model is based on the assumption that the fraction of dry matter lost per year is a constant. This leads to a two-parameter model of exponential decay in which the amount of material remaining (y) is a function of time (t) $y = y_0 e^{-bt}$

Taking logs on both sides: $log(y) = log(y_0) - bt$ we can estimate the parameter of interest, $$b$$, as the slope of a linear regression of $$log(y)$$ on $$t$$ (i.e. we log-transform the $$y$$ axis but not the $$x$$ axis) and the value of $$y_0$$ as the antilog of the intercept.

# get some data (y,t)
head(data)
##   time    amount
## 1    0 125.00000
## 2    1 100.24886
## 3    2  70.00000
## 4    3  83.47080
## 5    4 100.00000
## 6    5  65.90787
attach(data)
plot(time,amount,pch=16)

# if we apply log to both side of the equation:
# log(y) = log(y0) - bt
# which gives us a linear model, that can be fitted to the data:

model<-lm( log(amount) ~ time )
summary(model)
##
## Call:
## lm(formula = log(amount) ~ time)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.5935 -0.2043  0.0067  0.2198  0.6297
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.547386   0.100295   45.34  < 2e-16 ***
## time        -0.068528   0.005743  -11.93 1.04e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.286 on 29 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.825
## F-statistic: 142.4 on 1 and 29 DF,  p-value: 1.038e-12
# in this case, the slope b = -0.068 and the intercept is 4.547 = log(y0) <=> y0 = 94.38
# which, without the errors, turns out to be: y = 94.38 exp(-0.068t)

# let's include it in the plot:
xv<-seq(0,30,0.2)
yv<-exp(predict(model,list(time=xv)))
lines(xv,yv,col="red")

Linear Regression after Transformation (Power Laws)

Many mathematical functions that are non-linear in their parameters can be linearized by transformation. The most frequent transformations (in order of frequency of use), are logarithms, antilogs and reciprocals. Here is an example of linear regression associated with a power law: $y = a x^b$ with two parameters, where the parameter $$a$$ describes the slope of the function for low values of $$x$$ and $$b$$ is the shape parameter.

power<-read.table("power.txt",header=TRUE)
attach(power)
head(power)
##       area response
## 1 2.321550 2.367588
## 2 2.525543 2.881607
## 3 2.627449 2.641165
## 4 2.195558 2.592812
## 5 1.088055 2.159380
## 6 2.044432 2.365786
par(mfrow=c(1,2))
plot(area,response,pch=16)
model1 <- lm(response~area)
abline(model1)
plot(log(area),log(response),pch=16)
model2 <- lm(log(response)~log(area))
abline(model2)

The two plots look very similar in this case (they don’t always), but we need to compare the two models.

par(mfrow=c(1,1))
plot(area,response)
abline(lm(response~area))    # model1 fit
xv<-seq(1,2.7,0.01)
yv<-exp(0.75378)*xv^0.24818  # model2 fit: intercept&slope from model2
lines(xv,yv)

They seem quite close, but if we extend the axis, we see how different they judge outliers:

xv<-seq(0,5,0.01)
yv<-exp(predict(model2,list(area=xv))) # same as yv<-exp(0.75378)*xv^0.24818
plot(area,response,xlim=c(0,5),ylim=c(0,4),pch=16)
abline(model1, col="red")
lines(xv,yv, col="blue")

Both models are ok for interpolation (predicting within the range of known data), but offer quite different responses for extrapolation (predicting outside the range of known data)

Another eg: data taken from the inclined plane experiment made with Sofia (Fev 2, 2013)

height <- seq(4,32,4)
time <- c(144,115,103,88,76,72,60,54)
plot(time,height,xlim=c(20,180),ylim=c(0,60),pch=16) # wider
plot(time,height,xlim=c(20,180),ylim=c(0,60),pch=16)

model<-lm(time~height+I(height^2)) # Fitting a quadratic polynomial
x<-0:60
y<-predict(model,list(height=x))
lines(y,x,col="blue",lwd=2.5)

model2 <- lm(log(time)~log(height)) # Fitting a power law
x<-0:60
y<-exp(predict(model2,list(height=x)))
lines(y,x,col="red",lwd=2.5)

model3 <- lm(time~log(height)) # Fitting a (semi?) power law (seems the best fit
x<-0:60                        # for interpolation, all are bad for extrapolation)
y<-predict(model3,list(height=x))
lines(y,x,col="green",lwd=2.5)

legend("topright",
c("quadratic poly","power law","semi power law"),
lty=c(1,1,1), # gives the legend appropriate symbols (lines)
lwd=c(2.5,2.5,2.5),col=c("blue","red","green"))

Before moving on, let’s just make a list of some possible symbols that can be used in the regression formulas.

• ‘+’ means include this variable (eg: y ~ x1 + x2)
• ‘.’ means all variables (y ~ .)
• ‘-’ means delete this variable (y ~ . - x2)
• ‘*’ means include both variables plus their interaction (y ~ x1*x2)
• ‘:’ means just the interaction between the variables (y ~ x1:x2)
• ‘|’ means conditioning, y ~ x|z include x given z
• ‘^n’ means include the variables plus all the interaction up to n ways ( y~(x1+x2)^3 is equal to y ~ x1 + x2 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3)
• ‘I’ means as is, includes a new variable given the expression ( y ~ I(x^2) )
• ‘1’ means the intercept, -1 deletes the intercept (regress thru the origin)

The nature of the variables–binary, categorial (factors), numerical–will determine the nature of the analysis. For example, if “u” and “v” are factors, $y \sim u + v$ dictates an analysis of variance (without the interaction term). If “u” and “v” are numerical, the same formula would dictate a multiple regression. If “u” is numerical and “v” is a factor, then an analysis of covariance is dictated.

Local regression

Instead of using the entire set to define a fit, at each point in the data set a low-degree polynomial is fitted to a subset of the data. Local regression (LOESS) uses a nearest neighbors algorithm that determines which data is going to be used to fit the local polynomial. The polynomial is fitted using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away.

Here’s an animated gif (from SimplyStatistics.org) showing the process: