Regression

Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables. Perhaps the easiest way of knowing when regression is the appropriate analysis is to see that a scatterplot is the appropriate graphic. [The R Book, Crawley]

Linear Regression

The essence of regression analysis is using sample data to estimate parameter values and their standard errors. First, however, we need to select a model which describes the relationship between the response variable and the explanatory variable(s). The simplest of all is the linear model \[y = a + b.x\]

There are two variables and two parameters. The response variable is y, and x is a single continuous explanatory variable. The parameters are a and b: the intercept is a (the value of y when x = 0); and the slope is b (the change in y divided by the change in x which brought it about).

reg.data<-read.table("regression.txt",header=TRUE)
reg.data
##   growth tannin
## 1     12      0
## 2     10      1
## 3      8      2
## 4     11      3
## 5      6      4
## 6      7      5
## 7      2      6
## 8      3      7
## 9      3      8
attach(reg.data)

plot(tannin,growth,pch=16)
# To find the linear regression fit use 'lm'
# The response variable comes first (growth in our example), then the tilde ~, 
# then the name of the continuous explanatory variable (tannin).
fit <- lm(growth~tannin)

fit$coefficients
## (Intercept)      tannin 
##   11.755556   -1.216667
# draw the best fit line
abline(fit,col="red")

The values of the coefficients mean that the maximal likehood estimation to the equation that minimizes the overall error is \[\text{growth} = 11.756 - 1.217 * \text{tannin}\] ie, the line which is the best fit for the given dataset in the sense that it is the most probable line among all options.

The difference between a measured value of y and the value predicted by the model for the same value of x is called a residual.

Let’s see the residuals:

predt <- function(fit, x) {  # hand-made prediction function
  return (fit$coefficients[[1]] + x * fit$coefficients[[2]])
}

plot(tannin,growth,pch=16)
abline(fit,col="red")
segments(tannin,predt(fit,tannin),tannin,growth,col="red",lty=2)

# their specific values can also be accessed by this component:
fit$residuals
##          1          2          3          4          5          6 
##  0.2444444 -0.5388889 -1.3222222  2.8944444 -0.8888889  1.3277778 
##          7          8          9 
## -2.4555556 -0.2388889  0.9777778

For these regression models there are several important assumptions: + The variance in y is constant (i.e. the variance does not change as y gets bigger). + The explanatory variable, x, is measured without error. + Residuals are measured on the scale of y (i.e. parallel to the y axis). + The residuals are normally distributed.

Under these assumptions, the maximum likelihood is given by the method of least squares, ie, minimizing the sum of the squared errors (the residuals).

Confidence Intervals

In statistics, a confidence interval (CI) is a measure of the reliability of an estimate. It is a type of interval estimate of a population parameter. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level wikipedia

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
pl <- iris$Petal.Length
sl <- iris$Sepal.Length
model <- lm(pl ~ sl)
model
## 
## Call:
## lm(formula = pl ~ sl)
## 
## Coefficients:
## (Intercept)           sl  
##      -7.101        1.858
pred <- predict(model, data.frame(sl = sort(sl)), level = 0.95, interval = "confidence")
head(pred)
##         fit       lwr      upr
## 1 0.8898184 0.5928870 1.186750
## 2 1.0756617 0.7935781 1.357745
## 3 1.0756617 0.7935781 1.357745
## 4 1.0756617 0.7935781 1.357745
## 5 1.2615050 0.9940171 1.528993
## 6 1.4473483 1.1941605 1.700536
lower <- pred[,2]
upper <- pred[,3]

# make plot 
plot(sl, pl, pch=19, type="n")
grid()
polygon(c(sort(sl), rev(sort(sl))), c(upper, rev(lower)), col = "gray", border = "gray")
points(sl, pl, pch=19, cex=0.6)
abline(model, col="red", lwd=2)

Multi-linear Regression

Most of the times, we have a dataset with more than one predictor \(x_i\). One way is to apply simple linear regression to each predictor but this implies that each predictor does not care about the others which is not feasible. A better way is to fit all of them together:

\[y = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n + \epsilon\]

Function lm is able to do it:

marks.set <- read.csv(file="Advertising.csv", head=TRUE, sep=',')
head(marks.set)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
fit <- lm(Sales ~ TV+Radio, data=marks.set) # fitting TV and Radio
summary(fit)
## 
## Call:
## lm(formula = Sales ~ TV + Radio, data = marks.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## Radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16
fit2 <- lm(Sales ~ ., data=marks.set) # all features are predictors
summary(fit2)
## 
## Call:
## lm(formula = Sales ~ ., data = marks.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Some predictors might have stronger correlations than others:

cor(marks.set)
##                   TV      Radio  Newspaper     Sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## Radio     0.05480866 1.00000000 0.35410375 0.5762226
## Newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## Sales     0.78222442 0.57622257 0.22829903 1.0000000

In multi-linear regression a coefficient may not produce an effect because it is considered in the context of other coefficients. In this dataset, newspaper seems like an eg.

We can check for interactions, ie,

fit3 <- lm(Sales ~ TV*Radio, data=marks.set) # use TV, Radio and its interaction
summary(fit3)
## 
## Call:
## lm(formula = Sales ~ TV * Radio, data = marks.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Notice that this is no longer a simple linear relationship.

We should check for correlations between relevant coefficients (herein TV and Radio) to see if they complement each other in the estimation (low correlation) or just mirror the same effect (because they have high correlation). In this case, since Radio and TV have low correlation, it seems that a joint marketing in TV and Radio has some sinergetic effects.

Another note: we should be careful when deciding to remove coefficients because of their perceived contributions. The hierarchy principle says the following:

If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Polynomial Regression

Instead of using a line, we can generalize for polynomials. However, Polynomials are extremely flexible and the next example of approaching a sine wave (using just Calculus):

Using Maclaurin expansion we know that sin(x) is given by expression

\[x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + ...\]

Let’s do this in R:

x<-seq(-pi,pi,0.01)
y<-sin(x)
plot(x,y,type="l",ylab="sin(x)",lwd=2)

a1<-x-x^3/factorial(3)
lines(x,a1,col="green")  # 1st approximation

a2<-x-x^3/factorial(3)+x^5/factorial(5)
lines(x,a2,col="red")    # 2nd approximation

a3<-x-x^3/factorial(3)+x^5/factorial(5)-x^7/factorial(7)
lines(x,a3,col="blue")   # 3rd approximation

This is both a strength and a problem. If we use high-degree polynomials we risk overfitting, which is something we wish to avoid.

poly<-read.table("diminish.txt",header=TRUE)
attach(poly)
head(poly)
##   xv yv
## 1  5 26
## 2 10 29
## 3 15 31
## 4 20 30
## 5 25 35
## 6 30 37
plot(xv,yv,pch=16)
model1<-lm(yv~xv)  # fitting a linear model
abline(model1,col="red")

summary(model1)
## 
## Call:
## lm(formula = yv ~ xv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3584 -2.1206 -0.3218  1.8763  4.1782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.61438    1.17315   23.54 7.66e-14 ***
## xv           0.22683    0.02168   10.46 1.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.386 on 16 degrees of freedom
## Multiple R-squared:  0.8725, Adjusted R-squared:  0.8646 
## F-statistic: 109.5 on 1 and 16 DF,  p-value: 1.455e-08
model2<-lm(yv~xv+I(xv^2)) # Fitting a quadratic polynomial
x<-0:90
y<-predict(model2,list(xv=x))
plot(xv,yv,pch=16)
lines(x,y,col="red")

model3<-lm(yv~xv+I(xv^2)+I(xv^3)+I(xv^4)+I(xv^5)) # Fitting a quintic polynomial
x<-0:90
y<-predict(model3,list(xv=x))
plot(xv,yv,pch=16)
lines(x,y,col="red")

Another R tool is the function poly() (eg from Conway’s “Machine Learning for Hackers”, chapter 6):

library(ggplot2)
set.seed(1)
x <- seq(0, 1, by = 0.01)
y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
ggplot(df, aes(x = X, y = Y)) + 
  geom_point()

# using a linear regression:
model <- lm(Y ~ X, data = df)
summary(model)
## 
## Call:
## lm(formula = Y ~ X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00376 -0.41253 -0.00409  0.40664  0.85874 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.94111    0.09057   10.39   <2e-16 ***
## X           -1.86189    0.15648  -11.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4585 on 99 degrees of freedom
## Multiple R-squared:  0.5885, Adjusted R-squared:  0.5843 
## F-statistic: 141.6 on 1 and 99 DF,  p-value: < 2.2e-16
# it's possible to explain 58.9% of the variance
ggplot(data.frame(X = x, Y = y), aes(x = X, y = Y)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE)

# here's the polynomial regression with poly()
model2 <- lm(Y ~ poly(X, degree = 3), data = df)
summary(model2)
## 
## Call:
## lm(formula = Y ~ poly(X, degree = 3), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32331 -0.08538  0.00652  0.08320  0.20239 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.01017    0.01148   0.886    0.378    
## poly(X, degree = 3)1 -5.45536    0.11533 -47.302   <2e-16 ***
## poly(X, degree = 3)2 -0.03939    0.11533  -0.342    0.733    
## poly(X, degree = 3)3  4.41805    0.11533  38.308   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1153 on 97 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9737 
## F-statistic:  1235 on 3 and 97 DF,  p-value: < 2.2e-16
df <- transform(df, PredictedY = predict(model2))
ggplot(df, aes(x = X, y = PredictedY)) +
  geom_point() +
  geom_line()

# if we exagerate in the polynomial degree, things get overfitted:
model3 <- lm(Y ~ poly(X, degree = 25), data = df)
summary(model3)
## 
## Call:
## lm(formula = Y ~ poly(X, degree = 25), data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.211435 -0.040653  0.005966  0.040225  0.221436 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.017e-02  9.114e-03   1.116   0.2682    
## poly(X, degree = 25)1  -5.455e+00  9.159e-02 -59.561  < 2e-16 ***
## poly(X, degree = 25)2  -3.939e-02  9.159e-02  -0.430   0.6684    
## poly(X, degree = 25)3   4.418e+00  9.159e-02  48.236  < 2e-16 ***
## poly(X, degree = 25)4  -4.797e-02  9.159e-02  -0.524   0.6020    
## poly(X, degree = 25)5  -7.065e-01  9.159e-02  -7.713 4.19e-11 ***
## poly(X, degree = 25)6  -2.042e-01  9.159e-02  -2.230   0.0288 *  
## poly(X, degree = 25)7  -5.134e-02  9.159e-02  -0.561   0.5768    
## poly(X, degree = 25)8  -3.100e-02  9.159e-02  -0.338   0.7360    
## poly(X, degree = 25)9   7.723e-02  9.159e-02   0.843   0.4018    
## poly(X, degree = 25)10  4.809e-02  9.159e-02   0.525   0.6011    
## poly(X, degree = 25)11  1.300e-01  9.159e-02   1.419   0.1600    
## poly(X, degree = 25)12  2.473e-02  9.159e-02   0.270   0.7879    
## poly(X, degree = 25)13  2.371e-02  9.159e-02   0.259   0.7965    
## poly(X, degree = 25)14  8.791e-02  9.159e-02   0.960   0.3403    
## poly(X, degree = 25)15 -1.346e-02  9.159e-02  -0.147   0.8836    
## poly(X, degree = 25)16 -2.613e-05  9.159e-02   0.000   0.9998    
## poly(X, degree = 25)17  3.501e-02  9.159e-02   0.382   0.7034    
## poly(X, degree = 25)18 -1.372e-01  9.159e-02  -1.498   0.1384    
## poly(X, degree = 25)19 -6.913e-02  9.159e-02  -0.755   0.4528    
## poly(X, degree = 25)20 -6.793e-02  9.159e-02  -0.742   0.4606    
## poly(X, degree = 25)21 -1.093e-01  9.159e-02  -1.193   0.2365    
## poly(X, degree = 25)22  1.367e-01  9.159e-02   1.493   0.1397    
## poly(X, degree = 25)23 -3.921e-02  9.159e-02  -0.428   0.6698    
## poly(X, degree = 25)24 -5.032e-02  9.159e-02  -0.549   0.5844    
## poly(X, degree = 25)25  1.262e-01  9.159e-02   1.378   0.1722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09159 on 75 degrees of freedom
## Multiple R-squared:  0.9876, Adjusted R-squared:  0.9834 
## F-statistic: 238.1 on 25 and 75 DF,  p-value: < 2.2e-16
df <- transform(df, PredictedY = predict(model3))
ggplot(df, aes(x = X, y = PredictedY)) +
  geom_point() +
  geom_line()

Fitting a mechanistic model to the available data

Rather than fitting some arbitrary model for curvature (as above, with a quadratic term for inputs), we sometimes have a mechanistic model relating the value of the response variable to the explanatory variable (e.g. a mathematical model of a physical process). In the following example we are interested in the decay of organic material in soil, and our mechanistic model is based on the assumption that the fraction of dry matter lost per year is a constant. This leads to a two-parameter model of exponential decay in which the amount of material remaining (y) is a function of time (t) \[y = y_0 e^{-bt}\]

Taking logs on both sides: \[log(y) = log(y_0) - bt\] we can estimate the parameter of interest, \(b\), as the slope of a linear regression of \(log(y)\) on \(t\) (i.e. we log-transform the \(y\) axis but not the \(x\) axis) and the value of \(y_0\) as the antilog of the intercept.

# get some data (y,t)
data<-read.table("Decay.txt",header=TRUE)
head(data)
##   time    amount
## 1    0 125.00000
## 2    1 100.24886
## 3    2  70.00000
## 4    3  83.47080
## 5    4 100.00000
## 6    5  65.90787
attach(data)
plot(time,amount,pch=16)

# if we apply log to both side of the equation:
# log(y) = log(y0) - bt
# which gives us a linear model, that can be fitted to the data:

model<-lm( log(amount) ~ time )
summary(model)
## 
## Call:
## lm(formula = log(amount) ~ time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5935 -0.2043  0.0067  0.2198  0.6297 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.547386   0.100295   45.34  < 2e-16 ***
## time        -0.068528   0.005743  -11.93 1.04e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.286 on 29 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.825 
## F-statistic: 142.4 on 1 and 29 DF,  p-value: 1.038e-12
# in this case, the slope b = -0.068 and the intercept is 4.547 = log(y0) <=> y0 = 94.38
# which, without the errors, turns out to be: y = 94.38 exp(-0.068t)

# let's include it in the plot:
xv<-seq(0,30,0.2)
yv<-exp(predict(model,list(time=xv)))
lines(xv,yv,col="red")

Linear Regression after Transformation (Power Laws)

Many mathematical functions that are non-linear in their parameters can be linearized by transformation. The most frequent transformations (in order of frequency of use), are logarithms, antilogs and reciprocals. Here is an example of linear regression associated with a power law: \[y = a x^b\] with two parameters, where the parameter \(a\) describes the slope of the function for low values of \(x\) and \(b\) is the shape parameter.

power<-read.table("power.txt",header=TRUE)
attach(power)
head(power)
##       area response
## 1 2.321550 2.367588
## 2 2.525543 2.881607
## 3 2.627449 2.641165
## 4 2.195558 2.592812
## 5 1.088055 2.159380
## 6 2.044432 2.365786
par(mfrow=c(1,2))
plot(area,response,pch=16)
model1 <- lm(response~area)
abline(model1)
plot(log(area),log(response),pch=16)
model2 <- lm(log(response)~log(area))
abline(model2)

The two plots look very similar in this case (they don’t always), but we need to compare the two models.

par(mfrow=c(1,1))
plot(area,response)
abline(lm(response~area))    # model1 fit
xv<-seq(1,2.7,0.01)
yv<-exp(0.75378)*xv^0.24818  # model2 fit: intercept&slope from model2
lines(xv,yv)

They seem quite close, but if we extend the axis, we see how different they judge outliers:

xv<-seq(0,5,0.01)
yv<-exp(predict(model2,list(area=xv))) # same as yv<-exp(0.75378)*xv^0.24818
plot(area,response,xlim=c(0,5),ylim=c(0,4),pch=16)
abline(model1, col="red")
lines(xv,yv, col="blue")

Both models are ok for interpolation (predicting within the range of known data), but offer quite different responses for extrapolation (predicting outside the range of known data)

Another eg: data taken from the inclined plane experiment made with Sofia (Fev 2, 2013)

height <- seq(4,32,4)
time <- c(144,115,103,88,76,72,60,54)
plot(time,height,xlim=c(20,180),ylim=c(0,60),pch=16) # wider 
plot(time,height,xlim=c(20,180),ylim=c(0,60),pch=16)

model<-lm(time~height+I(height^2)) # Fitting a quadratic polynomial
x<-0:60
y<-predict(model,list(height=x))
lines(y,x,col="blue",lwd=2.5)

model2 <- lm(log(time)~log(height)) # Fitting a power law
x<-0:60
y<-exp(predict(model2,list(height=x)))
lines(y,x,col="red",lwd=2.5)

model3 <- lm(time~log(height)) # Fitting a (semi?) power law (seems the best fit 
x<-0:60                        # for interpolation, all are bad for extrapolation)
y<-predict(model3,list(height=x))
lines(y,x,col="green",lwd=2.5)

legend("topright",  
       c("quadratic poly","power law","semi power law"),
       lty=c(1,1,1), # gives the legend appropriate symbols (lines)
       lwd=c(2.5,2.5,2.5),col=c("blue","red","green"))

Before moving on, let’s just make a list of some possible symbols that can be used in the regression formulas.

  • ‘+’ means include this variable (eg: y ~ x1 + x2)
  • ‘.’ means all variables (y ~ .)
  • ‘-’ means delete this variable (y ~ . - x2)
  • ‘*’ means include both variables plus their interaction (y ~ x1*x2)
  • ‘:’ means just the interaction between the variables (y ~ x1:x2)
  • ‘|’ means conditioning, y ~ x|z include x given z
  • ‘^n’ means include the variables plus all the interaction up to n ways ( y~(x1+x2)^3 is equal to y ~ x1 + x2 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3)
  • ‘I’ means as is, includes a new variable given the expression ( y ~ I(x^2) )
  • ‘1’ means the intercept, -1 deletes the intercept (regress thru the origin)

The nature of the variables–binary, categorial (factors), numerical–will determine the nature of the analysis. For example, if “u” and “v” are factors, \[y \sim u + v\] dictates an analysis of variance (without the interaction term). If “u” and “v” are numerical, the same formula would dictate a multiple regression. If “u” is numerical and “v” is a factor, then an analysis of covariance is dictated.

Local regression

Instead of using the entire set to define a fit, at each point in the data set a low-degree polynomial is fitted to a subset of the data. Local regression (LOESS) uses a nearest neighbors algorithm that determines which data is going to be used to fit the local polynomial. The polynomial is fitted using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away.

Here’s an animated gif (from SimplyStatistics.org) showing the process:

It has two main parameters

  • \(\lambda\) - the degree of the local polynommial (usually linear or quadratic). If \(\lambda=0\) then LOESS turns into a moving average.
  • \(\alpha\) - the smoothing parameter (or span), ie, the proportion of data used in each fit. It must be between \((\lambda+1)/n and 1\) (for \(n\) data points). The larger the \(\alpha\) the smoother the fitting curve. Small values are bad since they tend to overfit the noise in the data (good values are 25%, 50%, 75%).
set.seed(101)
xs <- seq(0,2*pi,len=100)
df <- data.frame(x=xs, y=sin(xs)+rnorm(100,0,0.5))  # noisy dataset

plot(df$x, df$y, pch=19)
lines(xs, sin(xs), col="red", lwd=1) # target function

fit <- loess(y~x, data=df, span=0.75, degree=2)
pred <- predict(fit, df$x)
lines(df$x, pred, col="blue", lwd=2) # prediction

# draw confidence bands
library(msir)
## Package 'msir' version 1.3
## Type 'citation("msir")' for citing this R package in publications.
fit1 <- msir::loess.sd(df$x, df$y, span=0.75, degree=2)
# lines(xs, fit2$y) # the same as the prediction line above
lines(xs, fit1$upper, lty=2, col="cyan")  # one-sigma lines
lines(xs, fit1$lower, lty=2, col="cyan")

fit2 <- msir::loess.sd(df$x, df$y, span=0.75, degree=2, nsigma=2)
lines(xs, fit2$upper, lty=2, col="blue")  # two-sigma lines
lines(xs, fit2$lower, lty=2, col="blue")

Logistic Regression

The logistic function, or sigmoid function, has a real domain and a [0,1] range, which is appropriate to represent probability at some adequate contexts. Its formula is \[p(x) = \frac{1}{1+e^{-x}}\]

logit <- function(x) {
  return(1/(1+exp(-x)))
}
xs <- seq(-10,10,0.5)
plot(xs,logit(xs), type="l") + abline(v=0,lty=2)

## numeric(0)

One popular use of the logistic function is in demographic models, where \(p(x)\) is the population and \(t\) is time, modeling a saturation limit. Another place that is used in is neural networks as an activation function.

Logistic Regression fits a logistic function to a dataset:

marks.set <- read.csv(file="Marks.csv", head=TRUE, sep=',')
head(marks.set)
##     exam_1   exam_2 admitted
## 1 34.62366 78.02469        0
## 2 30.28671 43.89500        0
## 3 35.84741 72.90220        0
## 4 60.18260 86.30855        1
## 5 79.03274 75.34438        1
## 6 45.08328 56.31637        0
sapply(marks.set,class)
##    exam_1    exam_2  admitted 
## "numeric" "numeric" "integer"
# apply the logistic regression
model <- glm(admitted ~ exam_1 + exam_2, family = binomial("logit"), data=marks.set)
new.sample <- data.frame(exam_1=60, exam_2=86)       # a new sample 
predt <- predict(model, new.sample, type="response") # let's predict its class (admitted or not)
predt
##         1 
## 0.9894302
if (round(predt)==1) {   # we can round the result to get a binary response
  print("admmited")
} else {
  print("not admmited")
}
## [1] "admmited"
# Let's check the model's summary:
summary(model)
## 
## Call:
## glm(formula = admitted ~ exam_1 + exam_2, family = binomial("logit"), 
##     data = marks.set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.19287  -0.18009   0.01577   0.19578   1.78527  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -25.16133    5.79836  -4.339 1.43e-05 ***
## exam_1        0.20623    0.04800   4.297 1.73e-05 ***
## exam_2        0.20147    0.04862   4.144 3.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.6  on 99  degrees of freedom
## Residual deviance:  40.7  on 97  degrees of freedom
## AIC: 46.7
## 
## Number of Fisher Scoring iterations: 7

The same prediction value could be computed by the formula \[logit(\theta^T \times X)\]where \(\theta\) is the vector of coefficients, and \(X\) is the vector with the sample data. We assume that \(X_0=1\) and \(\theta_0\) is the intercept. So:

predictLogitModel <- function(theta,X) {  # compute logf(theta^T \times X)
  return ( logit( t(theta) %*% X ) )
}

X <- as.matrix(c(1,60,86),nrow=3)         # the sample data
C <- as.matrix(model$coefficients,nrow=3) # the model's coefficients
predictLogitModel(C,X)                    # and, as we see, it's the same value as above
##           [,1]
## [1,] 0.9894302

Logistic Regression is used in categorization (despite its name…), ie, the output is categorical, usually binary. The prediction output \(y = h(\theta,x)\) is within 0 and 1 and can be interpreted has the estimated probability of \(p(y=1|x,\theta)\).

So, as we did before, if there’s no different cost associated to each decision, we decide \(y=1\) when \(p(y=1|x,\theta) \gt 0.5\), or decide \(y=0\) otherwise.

This is the same to decide \(y=1\) when \(\theta^T \times X \geq 0\), and \(y=0\) otherwise. The equation \[\theta^T \times X = 0\] is called the decision boundary.

There is nothing to prevent us to use non-linear decision boundaries:

library(plotrix)
set.seed(222)  
round.data <- data.frame(x=rnorm(40,0,1),
                         y=rnorm(40,0,1))
data.class <- round.data$x^2 + round.data$y^2 > 1 # make a non-linear separable dataset: the decision boundary is the unit circle
round.data$class <- data.class*1
head(round.data)
##              x          y class
## 1  1.487757090  0.2023869     1
## 2 -0.001891901  0.4951010     0
## 3  1.381020790 -0.5693554     1
## 4 -0.380213631  1.1192945     1
## 5  0.184136230  2.2090779     1
## 6 -0.246895883  0.3171826     0
plot(x~y,data=round.data,col=round.data$class+1,xlim=c(-3,3),ylim=c(-3,3),pch=19)
draw.circle(0,0,1,lty=2,border="red")

model <- glm(data.class ~ x+y++I(x^2)++I(y^2), 
             family = binomial("logit"), data=round.data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
theta <- model$coefficients  # get the coefficients
# if we scale these values we notice that the regression approximates the unit circle equation x^2 + y^2 = 1
theta/max(theta)
## (Intercept)           x           y      I(x^2)      I(y^2) 
## -0.93797140 -0.07723953  0.13032606  1.00000000  0.88413739
predt <- round( predict(model, round.data[-3], type="response") )
table(round.data$class, predt)
##    predt
##      0  1
##   0 14  0
##   1  0 26

Let’s draw the estimated boundary decision in comparisation with the unit circle (the real boundary):

predModel <- function (x,y) {
  return (theta[1] + theta[2]*x + theta[3]*y + theta[4]*x^2 + theta[5]*y^2)
}
xs <- seq(-4,4,0.1)
ys <- seq(-4,4,0.1)
zs <- outer(xs,ys,predModel)

plot(x~y,data=round.data,col=round.data$class+1,xlim=c(-3,3),ylim=c(-3,3),pch=19)
draw.circle(0,0,1,lty=2,border="red")
contour(xs,ys,zs,nlev=0,add=TRUE,col="blue",lty=2)

Spline Regression

ref: James et al. - Intr to Statistical Learning with R Applications (2013)

Instead of fitting a high-degree polynomial over the entire range of \(X\), piecewise polynomial regression involves fitting separate low-degree polynomials over different regions of \(X\). The points where the coefficients change are called knots. Using more knots leads to a more flexible piecewise polynomial. In general, if we place \(K\) different knots throughout the range of \(X\), then we will end up fitting \(K+1\) different polynomials.

To prevent jumps around the knots, the process adds restrictions to which polynommials can be choosen for each region. Usually, for polinomials of degree \(d\) the constraints demand that each pair of polynomials must be continuous up to the \(d-1\) derivative. Usually \(d=3\), so the constraint goes until the second derivative. These are called cubic splines.

library(splines)
library(ISLR) # includes Wage dataset

data(Wage) 
agelims  <- range(Wage$age)
age.grid <- seq(from=agelims[1],to=agelims[2])

In the next code we prespecified knots at ages 25, 40 and 60. This produces a spline with six basis functions. (recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.)

fit <- lm(wage ~ bs(age, knots=c(25,40,60) ), # prespecified knots at ages 25, 40, and 60
          data=Wage)

pred <- predict(fit, newdata=list(age=age.grid), se.fit=TRUE)

plot(Wage$age, Wage$wage, col="gray")
lines(age.grid, pred$fit, lwd=2)
lines(age.grid, pred$fit+2*pred$se.fit, lty="dashed")
lines(age.grid, pred$fit-2*pred$se.fit, lty="dashed")
for (i in c(25,40,60))
  abline(v=i, col="red", lty=2)  # draw position of knots

We could also use the df option to produce a spline with knots at uniform quantiles of the data.

dim(bs(Wage$age, knots=c(25,40,60)))
## [1] 3000    6
dim(bs(Wage$age, df=6))
## [1] 3000    6
my.knots <- attr(bs(Wage$age, df=6), "knots")

In this case R chooses knots at ages 33.8, 42.0, and 51.0, which correspond to the 25th, 50th, and 75th percentiles of age.

fit.2 <- lm(wage ~ bs(age, knots= my.knots), data=Wage)
pred.2 <- predict(fit.2, newdata=list(age=age.grid), se.fit=TRUE)

plot(Wage$age, Wage$wage, col="gray")
lines(age.grid, pred.2$fit, lwd=2)
lines(age.grid, pred.2$fit+2*pred$se.fit, lty="dashed")
lines(age.grid, pred.2$fit-2*pred$se.fit, lty="dashed")
for (i in 1:length(my.knots))
  abline(v=my.knots[[i]], col="red", lty=2)  # draw position of knots

The function bs() also has a degree argument, so we can fit splines of any degree, rather than the default degree of 3 (which yields a cubic spline).

Multiclass classification

If there’s the need for multiclass classification the typical method is called one vs. all. For \(n\) classes, we create \(n\) predictors. For predictor \(i\) we create a binary classification where class \(i\) is considered as \(1\) and all the others are \(0\). With this process we get \(n\) models \(h^{(i)}(x)\), each one predicting that \(y=i\).

So, when we get a new input \(x\), we apply it to the \(n\) predictors and choose the one with higher value. If the \(k^{th}\) predictor output the higher value, than our prediction will be \(y=k\).

Regularization

From http://horicky.blogspot.pt/2012/05/predictive-analytics-generalized-linear.html

With a large size of input variable but moderate size of training data, we are subjected to the overfitting problem, which is our model fits too specific to the training data and not generalized enough for the data we haven’t seen. Regularization is the technique of preventing the model to fit too specifically to the training data.

In linear regression, it is found that overfitting happens when \(\theta\) has a large value. So we can add a penalty that is proportional to the magnitude of \(\theta\). In L2 regularization (also known as Ridge regression), \(\sum_i \theta_i^2\) will be added to the cost function, while In L1 regularization (also known as Lasso regression), \(\sum_i |\theta_i|\) will be added to the cost function.

Both L1, L2 will shrink the magnitude of \(\theta_i\), L2 tends to make dependent input variables having the same coefficient while L1 tends to pick of the coefficient of variable to be non-zero and other zero. In other words, L1 regression will penalize the coefficient of redundant variables that are linearly dependent and is frequently used to remove redundant features.

Combining L1 and L2, the general form of cost function becomes Cost == Non-regularization-cost + \(\lambda (\alpha \sum_i |\theta_i| + (1-\alpha) \sum_i \theta_i^2)\)

Notice there are two tunable parameters, \(\lambda\) and \(\alpha\). Lambda controls the degree of regularization (0 means no-regularization, infinity means ignoring all input variables because all coefficients of them will be zero). Alpha controls the degree of mix between L1 and L2. (0 means pure L2 and 1 means pure L1).

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
# the cross validation selects the best lambda
cv.fit <- cv.glmnet(as.matrix(iris[,-5]), 
                    iris[,5],
                    nlambda=100, alpha=0.8, 
                    family="multinomial")
plot(cv.fit)

cv.fit$lambda.min # best lambda, ie, that gives minimum mean cross-validated error
## [1] 0.0005565396
prediction <- predict(cv.fit, newx=as.matrix(iris[,-5]), type="class")
table(prediction, iris$Species)
##             
## prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         1
##   virginica       0          3        49

Instead of picking lambda (the degree of regularization) based on cross validation, we can also based on the number of input variables that we want to retain. So we can plot the “regularization path” which shows how the coefficient of each input variables changes when the lambda changes and pick the right lambda that filter out the number of input variables for us.

# try alpha = 0, Ridge regression
fit <- glmnet(as.matrix(iris[,-5]), 
              iris[,5], alpha=0, 
              family="multinomial")
plot(fit)

legend("topleft", names(iris)[-5], col = 1:4, lwd=1)

There is another penalty function called Huber Loss Function which is robust to outliers providing a way to make robust regression.

The penalty function is (\(a\) being the residual equal to \(y-\hat{y}\))

\[L_{\delta}(a) = 0.5 a^2, for |a| < \delta\] \[L_{\delta}(a) = \delta ( |a| - \delta/2), otherwise\]

making it quadratic for small values of \(a\), and linear for larger values.

[…] the Huber loss function is convex in a uniform neighborhood of its minimum a=0, at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points \(a=-\delta\) and \(a = \delta\) . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the L2 quadratic loss function) and the robustness of the median-unbiased estimor (using the L1 absolute value function). [wikipedia]

The function \(L(a) = \log(cosh(a))\) has a similar behavior.

The pseudo-huber loss function \(L_{\delta}(a) = \delta^2 (\sqrt{1+(a/\delta)^2} - 1)\) is a smooth approximation of the huber, which has continuous derivates of all degrees.

delta = 1
xs <- seq(-2,2,len=100)

huber <- function(a,delta) {
  ifelse(abs(a)<delta,a^2/2,delta*(abs(a)-delta/2))
}

plot(xs,huber(xs,delta),type="l",col="blue",lwd=2)
lines(xs,log(cosh(xs)),col="red")                      # log.cosh  curve
lines(xs,delta^2*(sqrt(1+(xs/delta)^2)-1),col="green") # pseudo huber curve

Non-linear Regression

A nonlinear regression classical example is a model like this:

\[y = f(x, \theta) + \epsilon\]

with error \(\epsilon \sim \mathcal{0,\sigma^2}\), response \(y\), covariantes \(x\) and model parameters \(\theta\). The main difference for linear regression, is that \(f\) does not have restrictions on its form.

In base R we have access to nls which performs non-linear regression. The next egs are from its help file:

x <- -(1:100)/10
y <- 100 + 10 * exp(x / 2) + rnorm(x)/5  # some non-linear, exponential relatioship

fit <- nls(y ~ Const + A * exp(B * x),   # make the non-linear regression fit
           start = list(Const=1, A = 1, B = 1)) 
fit$m$getPars()                          # check estimated parameters
##      Const          A          B 
## 99.9445814  9.9768619  0.4894439
plot(x, y, main = "", pch=20)
curve(100 + 10 * exp(x / 2), col = "blue", lwd=4, add = TRUE)
lines(x, predict(fit), col = "red", lwd=2)

If we have some information about where to search for the parameters, we can use start, and if we need to tweak the number of interations or the convergence tolerance we use control:

fit <- nls(y ~ Const + A * exp(B * x), 
           start   = list(Const=5, A = 1, B = 0.5),
           control = nls.control(maxiter = 100, tol = 1e-05))

Another eg:

my_data <- subset(DNase, Run == 1)
head(my_data)
##   Run       conc density
## 1   1 0.04882812   0.017
## 2   1 0.04882812   0.018
## 3   1 0.19531250   0.121
## 4   1 0.19531250   0.124
## 5   1 0.39062500   0.206
## 6   1 0.39062500   0.215
fit1 <- nls(density ~ 1/(1 + exp((xmid - log(conc))/scal)),
            data  = my_data,
            start = list(xmid = 0, scal = 1),
            algorithm = "plinear")
summary(fit1)
## 
## Formula: density ~ 1/(1 + exp((xmid - log(conc))/scal))
## 
## Parameters:
##      Estimate Std. Error t value Pr(>|t|)    
## xmid  1.48309    0.08135   18.23 1.22e-10 ***
## scal  1.04145    0.03227   32.27 8.51e-14 ***
## .lin  2.34518    0.07815   30.01 2.17e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01919 on 13 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.032e-06
# plot it:
xs <- data.frame(conc=seq(0,13,len=100))
plot(my_data$conc, my_data$density, pch=20)
lines(xs$conc, predict(fit1,newdata=xs), col="red", lwd=2)

A more complex eg with indexing:

data(muscle, package = "MASS")

## The non linear model considered is
##       Length = alpha + beta*exp(-Conc/theta) + error
## where theta is constant but alpha and beta may vary with Strip.

with(muscle, table(Strip)) # 2, 3 or 4 obs per strip
## Strip
## S01 S02 S03 S04 S05 S06 S07 S08 S09 S10 S11 S12 S13 S14 S15 S16 S17 S18 
##   4   4   4   3   3   3   2   2   2   2   3   2   2   2   2   4   4   3 
## S19 S20 S21 
##   3   3   3
## We first use the plinear algorithm to fit an overall model,
## ignoring that alpha and beta might vary with Strip.

musc.1 <- nls(Length ~ cbind(1, exp(-Conc/th)), muscle,
              start = list(th = 1), algorithm = "plinear")
summary(musc.1)
## 
## Formula: Length ~ cbind(1, exp(-Conc/th))
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## th      0.6082     0.1146   5.309 1.89e-06 ***
## .lin1  28.9633     1.2298  23.552  < 2e-16 ***
## .lin2 -34.2274     3.7926  -9.025 1.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.666 on 57 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 9.319e-07
## Then we use nls' indexing feature for parameters in non-linear
## models to use the conventional algorithm to fit a model in which
## alpha and beta vary with Strip.  The starting values are provided
## by the previously fitted model.
## Note that with indexed parameters, the starting values must be
## given in a list (with names):
b <- coef(musc.1)
musc.2 <- nls(Length ~ a[Strip] + b[Strip]*exp(-Conc/th), muscle,
              start = list(a = rep(b[2], 21), b = rep(b[3], 21), th = b[1]))
summary(musc.2)
## 
## Formula: Length ~ a[Strip] + b[Strip] * exp(-Conc/th)
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## a1   23.4541     0.7962  29.456 4.96e-16 ***
## a2   28.3020     0.7927  35.703  < 2e-16 ***
## a3   30.8007     1.7156  17.953 1.73e-12 ***
## a4   25.9211     3.0158   8.595 1.36e-07 ***
## a5   23.2008     2.8912   8.025 3.50e-07 ***
## a6   20.1200     2.4354   8.261 2.35e-07 ***
## a7   33.5953     1.6815  19.979 3.04e-13 ***
## a8   39.0527     3.7533  10.405 8.63e-09 ***
## a9   32.1369     3.3175   9.687 2.46e-08 ***
## a10  40.0052     3.3358  11.993 1.02e-09 ***
## a11  36.1904     3.1095  11.639 1.60e-09 ***
## a12  36.9109     1.8390  20.071 2.82e-13 ***
## a13  30.6346     1.7004  18.016 1.64e-12 ***
## a14  34.3118     3.4951   9.817 2.03e-08 ***
## a15  38.3952     3.3749  11.377 2.27e-09 ***
## a16  31.2258     0.8857  35.257  < 2e-16 ***
## a17  31.2303     0.8214  38.019  < 2e-16 ***
## a18  19.9977     1.0108  19.784 3.58e-13 ***
## a19  37.0953     1.0706  34.650  < 2e-16 ***
## a20  32.5942     1.1212  29.070 6.18e-16 ***
## a21  30.3757     1.0570  28.738 7.48e-16 ***
## b1  -27.3004     6.8732  -3.972 0.000985 ***
## b2  -26.2702     6.7537  -3.890 0.001178 ** 
## b3  -30.9011     2.2700 -13.613 1.43e-10 ***
## b4  -32.2384     3.8100  -8.461 1.69e-07 ***
## b5  -29.9406     3.7728  -7.936 4.07e-07 ***
## b6  -20.6219     3.6473  -5.654 2.86e-05 ***
## b7  -19.6246     8.0848  -2.427 0.026610 *  
## b8  -45.7799     4.1131 -11.130 3.15e-09 ***
## b9  -31.3446     6.3522  -4.934 0.000126 ***
## b10 -38.5987     3.9555  -9.758 2.21e-08 ***
## b11 -33.9211     3.8388  -8.836 9.19e-08 ***
## b12 -38.2680     8.9920  -4.256 0.000533 ***
## b13 -22.5683     8.1943  -2.754 0.013550 *  
## b14 -36.1669     6.3576  -5.689 2.66e-05 ***
## b15 -32.9521     6.3539  -5.186 7.44e-05 ***
## b16 -47.2068     9.5403  -4.948 0.000122 ***
## b17 -33.8746     7.6884  -4.406 0.000386 ***
## b18 -15.8962     6.2222  -2.555 0.020508 *  
## b19 -28.9690     7.2353  -4.004 0.000919 ***
## b20 -36.9171     8.0325  -4.596 0.000257 ***
## b21 -26.5075     7.0125  -3.780 0.001494 ** 
## th    0.7969     0.1266   6.296 8.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.115 on 17 degrees of freedom
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 2.171e-06

Package nlstools

Ref:

Package nlstools is a unified diagnostic framework for non-linear regression. It deals with the following problems:

  • The iterative estimation usually requires initial values of the model parameters. These values shouldbe relatively close to the ‘real’ values for convergence.

  • The validity of the model fit using diagnostic and visual tools

  • The construction of confidence intervals with non-parametric methods

library(nlstools)

The next eg shows a model for oxygen kinetics during 6-minute walk tests. There are two distict phases. One before time \(\lambda \geq 5.883\) which is a rest phase, and then after a walking stage which is modelled by an exponential. The formula is shown in the code

head(O2K)
##           t      VO2
## 1 0.0000000 377.1111
## 2 0.3333333 333.3333
## 3 0.6666667 352.1429
## 4 1.0000000 328.7500
## 5 1.3333333 369.8750
## 6 1.6666667 394.4000
plot(O2K, pch=20)

formulaExp <- 
  as.formula(VO2 ~ (t <= 5.883) * VO2rest + 
                   (t >  5.883) *(VO2rest + (VO2peak-VO2rest)*(1-exp(-(t-5.883)/mu))))


fit_O2K <- nls(formulaExp, 
               start = list(VO2rest=400, VO2peak=1600, mu=1), data = O2K)
overview(fit_O2K)
## 
## ------
## Formula: VO2 ~ (t <= 5.883) * VO2rest + (t > 5.883) * (VO2rest + (VO2peak - 
##     VO2rest) * (1 - exp(-(t - 5.883)/mu)))
## 
## Parameters:
##          Estimate Std. Error t value Pr(>|t|)    
## VO2rest 3.568e+02  1.141e+01   31.26   <2e-16 ***
## VO2peak 1.631e+03  2.149e+01   75.88   <2e-16 ***
## mu      1.186e+00  7.661e-02   15.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.59 on 33 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 7.598e-06
## 
## ------
## Residual sum of squares: 81200 
## 
## ------
## t-based confidence interval:
##                2.5%       97.5%
## VO2rest  333.537401  379.980302
## VO2peak 1587.155300 1674.611703
## mu         1.030255    1.342002
## 
## ------
## Correlation matrix:
##            VO2rest    VO2peak        mu
## VO2rest 1.00000000 0.07907046 0.1995377
## VO2peak 0.07907046 1.00000000 0.7554924
## mu      0.19953773 0.75549241 1.0000000
plotfit(fit_O2K, smooth = TRUE, lwd=2, pch.obs=20)

Assessing Goodness of fit

Given the parameter estimates \(\hat{\theta}\), we can use the residuals \(\hat{\epsilon}\) to check the quality of the regression

\[\hat{\epsilon} = y - f(x, \hat{\theta})\]

To get the residuals:

e_hat <- nlsResiduals(fit_O2K)
plot(e_hat)

We see that the residuals seem approximately normally distributed, and there’s no evidence of autocorrelation.

test.nlsResiduals(e_hat)
## 
## ------
##  Shapiro-Wilk normality test
## 
## data:  stdres
## W = 0.95205, p-value = 0.1214
## 
## 
## ------
## 
##  Runs Test
## 
## data:  as.factor(run)
## Standard Normal = 0.76123, p-value = 0.4465
## alternative hypothesis: two.sided

which means the null hypothesis (they are normally distributed) cannot be rejected. The second test cannot reject the hypothesis of no autocorrelation.

The next functions compute confidence regions for the parameters:

O2K.cont1 <- nlsContourRSS(fit_O2K)
## 0%33%66%100%
##  RSS contour surface array returned
plot(O2K.cont1, col = FALSE, nlev = 5)

O2K.conf1 <- nlsConfRegions(fit_O2K, exp = 2, length = 2000)
##   0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%100%
##  Confidence regions array returned
plot(O2K.conf1, bounds = TRUE, pch=20)