Refs:

suppressWarnings(library(CVXR, warn.conflicts=FALSE))

library(magrittr)
library(ggplot2)
library(ggforce)

Introduction

A convex combination of \(x_1 \ldots x_k\) is any point

\[x = \theta_1 x_1 + \ldots + \theta_k x_k\]

with \(\theta_1 + \ldots + \theta_k = 1, \theta_i \geq 0\)

A convex hull of set S is the set of all convex combinations of points in S.

A conic combination of \(x_1, x_2\) is any point

\[x = \theta_1 x_1 + \theta_2 x_2\]

with \(\theta_i \geq 0\)

A convex cone of set S is the set of all conic combinations of points in S.

Important convex sets

A hyperplane is a set of form \(\{ x | a^T x = b\}\), where \(a \in \mathbb{R}^n, a \neq 0, b \in \mathbb{R}\). In 2D (n=2) this is a line, in 3D it’s a plane.

\(a\) can be seen as the normal vector to the hyperplane, and \(b\) its offset from the origin.

A hyperplane divides \(\mathbb{R}^n\) into two hyperplanes.

An Euclidian ball in \(\mathbb{R}^n\) with center \(x_c\) and radius \(r\) has the form

\[B(x_c, r) = \{ x | \Vert x-x_c \Vert_2 \leq r \} = \{ x|(x-x_c)^T(x-x_c) \leq r^2 \}\]

where \(\Vert \Vert_2\) is the Euclidean norm, i.e., \(\Vert u \Vert_2 = (u^T u)^{1/2}\)

An ellipsoid is a set of the form:

\[\{ x|(x-x_c)^T P^{-1}(x-x_c) \leq 1 \}\]

where \(P\) is a symmetric and positive-definite matrix, i.e., \(P=P^T \succ 0\). The matrix \(P\) determines hor far the ellipsoid extends in every direction form the center \(x_c\). A Euclidean ball is an ellipsoid when \(P=r^2 I\).

A norm ball given norm \(\Vert \cdot \Vert\) is \(\{ x | \Vert x-x_c \Vert \leq r \}\).

Remember that a function is a norm if \(\Vert x \Vert \geq 0\), \(\Vert t x \Vert = |t| \Vert x \Vert\), and \(\Vert x+y \Vert \leq \Vert x \Vert + \Vert y \Vert\).

A norm cone is a set \(\{ (x,t) | \Vert x \Vert \leq t \}\)

If the norm is the Euclidean norm, then the norm cone is called second-order cone.

A Polyhedra is the solution set of several linear inequalities and equalities

\[Ax \preceq b, \quad Cx = d\]

with \(A \in \mathbb{R}^{m \times n}, C \in \mathbb{R}^{p \times n}\).

A Polyhedra is the intersection of a finite number of halfspaces and halfplanes.

Optimization Problems

An optimization problem in standard form:

minimize \(f_0(x)\)

subject to

\(x\) is the optimization variable (can be a vector or matrix), \(f_0\) is the objective or cost function, \(f_i\) are the inequality constraints and \(h_i\) the equality constraints.

If instead of minimizing, we wish to maximize, \(f_0\) usually is called the utility or profit function.

\(x\) is feasible if satisfies the constraints and belongs to the domain of \(f_0\).

The optimal value \(p^*\) is the solution we want, if it exists. If \(p^* = \infty\) the problem is infeasible (no \(x\) satisfies the constraints). If \(p^* = - \infty\) the problem in unbounded below which is a sign of a incorrect model.

The set of optimal points is \(X_{opt} = \{x | p^*=f_0(x) \land x \in \text{dom} f_0 \}\)

There are implicit constraints that determine that \(x\) must fall into the domain of functions \(f_0, f_i, h_i\).

The next eg does not have explicit constraints

Say we have some data:

set.seed(100)
k <- 3
A <- matrix(rnorm(k*k, 10), k, k)
b <- rnorm(k, 10)

and this is our optimization problem:

minimize \(f_0(x) = - \sum_{i=1}^k \log (b_i - a_i^T x)\)

but \(a^T_ix < b_i\) is an implicit constraint.

Using CVXR:

x <- Variable(k)      # define variable as a vector 3x1

objective <- Minimize( -sum(log(b-A%*%x)) )  # define objective function
problem   <- Problem(objective)              # create problem instance

## Checking problem solution

solution <- solve(problem) 
solution$status       # can return optimal, optimal_inaccurate, infeasible or unbounded
## [1] "optimal_inaccurate"
solution$getValue(x)  # an optimal x
##           [,1]
## [1,]  13352326
## [2,] -10067209
## [3,]  -9496189
solution$value        # p* = f0(x)
## [1] -51.70338

Package CVXR deals with convex optimization problems. These are optimization problems where functions \(f_0, f_i\) are convex, and the equality constraints are affine (ie, linear functions with translation):

minimize \(f_0(x)\)

subject to

  • \(f_i(x) \leq 0, i=1,\dots,m\)

  • \(a_i^T x = b_i, i=1,\dots,p\) or just \(Ax = b\)

An important property is that the feasible set of a convex problem is itself convex.

Another important property is that a locally optimal point of a convex problem is (globally) optimal.

One important point is that convexity is an attribute of the description of the problem.

Consider the following problem

minimize \(f_0(x) = x_1^2 + x_2^2\)

subject to

  • \(f_1(x) = x_1 / (1+x_2^2) \leq 0\)

  • \(h_1(x) = (x_1 + x_2)^2 = 0\)

which definitely is not a convex problem (\(h_1\) is not affine, for eg).

However, asking if some \(x^2 = 0\) is the same as asking if some \(x=0\)

Also, \(1+x_2^2\) is always positive.

So the original problem can be transformed into an equivalent problem:

minimize \(f_0(x) = x_1^2 + x_2^2\)

subject to

  • \(f_1(x) = x_1 \leq 0\)

  • \(h_1(x) = x_1 + x_2 = 0\)

which is a convex problem!

x1 <- Variable(1)      # a scalar
x2 <- Variable(1)      # a scalar

objective   <- Minimize( x1^2 + x2^2 ) 
constraints <- list(x1 <= 0, x1 + x2 == 0)
problem     <- Problem(objective, constraints)

## Checking problem solution

solution <- solve(problem) 
solution$status       
## [1] "optimal"
solution$getValue(x1) 
## [1] -9.065648e-06
solution$getValue(x2) 
## [1] 9.065651e-06
solution$value
## [1] -1.021385e-09

Linear Optimization

A linear program (LP) has an affine objective and constraint functions.

The feasible set is a polyhedron (the simplex method is an algorithm that moves in this polyhedron to find the optimal solution).

Example: choose cheapest healthy diet \(x\) with quantities \(x_1 \ldots x_n\) of \(n\) foods, where (a) one unit of food \(j\) costs \(c_j\) and contains amount \(a_{ij}\) of nutrient \(i\); (b) healthy diet requires nutrient \(i\) in quantity at least \(b_i\).

minimize \(c^T x\)

subject to

  • \(Ax \geq b\)

  • \(x \geq 0\)
n <- 4  # number of foods
i <- 3  # number nutrients

# nutrient's amounts (column is food, row is nutrient)
A <- matrix(c(10, 40,  0, 20,   
               5, 20, 15, 11,
              20, 10,  6,  0), ncol=n, nrow=i, byrow = T)
# minimal amounts
b <- c(50, 10, 20)
# costs
c <- c(30, 10, 5, 15)

The problem specification:

x <- Variable(4)      # 4x1 vector

objective   <- Minimize( t(c) %*% x ) 
constraints <- list(A%*%x >= b, x >= 0)
problem     <- Problem(objective, constraints)

## Checking problem solution

solution <- solve(problem) 
solution$status       
## [1] "optimal"
round(solution$getValue(x),3)
##      [,1]
## [1,] 0.00
## [2,] 1.25
## [3,] 1.25
## [4,] 0.00
solution$value
## [1] 18.75

Piecewise-linear optimization

These are problems like:

minimize \(\text{max}_{i=1\ldots m} a_i^T x + b_i\)

graphically, it’s the minimum of a given section of a polyhedra

This problem is equivalent to the following LP:

minimize \(t\)

subject to

  • \(a_i^T x + b_i \leq t, i=1\ldots m\)

Example: find the Chebyshev center of a polyhedron.

The Chebyshex center \(x_c\) is the center of the largest inscribed circle (in a sense, the deepest point inside the polyhedron). Be \(r\) the radius of that circle.

n <- 2
A <- matrix(c(2,  1, 
              2, -1, 
             -1,  2,
             -1, -2), nrow=4, ncol=n, byrow = TRUE)
b <- c(1,1,1,1)
norm_A <- rowSums(A^2)^0.5  # given A, norms are just scalars
ggplot() +
    geom_abline(slope = -A[,1]/A[,2], intercept = b/A[,2]) +
    labs(x = "x", y = "y") +
    theme_bw() +
    xlim(-1, 1) + ylim(-1, 1)

This problem can be solved by the following LP:

maximize \(r\)

subject to

  • \(a_i^T x_c + r \Vert a_i \Vert_2 \leq b_i, i=1\ldots m\)
r   <- Variable(1, name="radius")
x_c <- Variable(n, name="center")

objective   <- Maximize( r ) 
constraints <- list(A%*%x_c + r*norm_A <= b)
problem     <- Problem(objective, constraints)

## Checking problem solution

solution <- solve(problem) 
solution$status       
## [1] "optimal"
solution$getValue(x_c)
##               [,1]
## [1,] -2.643728e-10
## [2,] -5.614499e-17
solution$value
## [1] 0.4472136
center <- round(as.double(solution$getValue(x_c)),3)
radius <- round(as.double(solution$getValue(r)),3)

ggplot() +
    geom_abline(slope = -A[,1]/A[,2], intercept = b/A[,2]) +
    geom_circle(mapping = aes(x0 = center[1], y0 = center[2], r = radius), color = "blue") +
    geom_point(mapping = aes(x = center[1], y = center[2]), color = "red", size = 2) +
    labs(x = "x", y = "y") +
    theme_bw() +
    xlim(-1, 1) + ylim(-1, 1)

Quadratic Programming

A convex optimization problem is called a quadratic programmin (QP) if the objective function is a convex quadratic and the constraint functions are affine.

An example:

minimize \(x^2 + y^2\)

subject to

  • \(x \geq 0\)

  • \(2x+y=1\)
x <- Variable(1)
y <- Variable(1)

objective   <- Minimize(x^2 + y^2)
constraints <- list(x >= 0, 2*x + y == 1)
problem     <- Problem(objective, constraints)

# Probleolution
solution <- solve(problem) 
solution$status
## [1] "optimal"
solution$getValue(x)
## [1] 0.3999978
solution$getValue(y)
## [1] 0.2000044
solution$value
## [1] 0.2

Least Squares

A very well known problem that’s QP is least squares (linear regression)

Let’s generate some data based on a linear model: \(Y = X \beta + \epsilon\)

set.seed(123)

n <- 100  
p <- 10
beta <- -4:5   # the true parameters are [−4,...,−1,0,1,...,5]

X <- matrix(rnorm(n * p), nrow=n)
Y <- X %*% beta + rnorm(n)

The QP is

minimize \(\|Y - X\beta\|_2^2\)

betaHat   <- Variable(p)
objective <- Minimize(sum((Y - X %*% betaHat)^2))
problem   <- Problem(objective)

result <- solve(problem)
result$getValue(betaHat) %>% round(3)
##         [,1]
##  [1,] -3.920
##  [2,] -3.012
##  [3,] -2.125
##  [4,] -0.867
##  [5,]  0.091
##  [6,]  0.949
##  [7,]  2.076
##  [8,]  3.127
##  [9,]  3.961
## [10,]  5.135

We can compare with the use of R’s lm:

ls.model <- lm(Y ~ 0 + X)
matrix(coef(ls.model), ncol = 1) %>% round(3)
##         [,1]
##  [1,] -3.920
##  [2,] -3.012
##  [3,] -2.125
##  [4,] -0.867
##  [5,]  0.091
##  [6,]  0.949
##  [7,]  2.076
##  [8,]  3.127
##  [9,]  3.961
## [10,]  5.135

So we just emulate lm… But this QP can be updated to include variants of the classical method. Consider we would like all estimates of \(\beta\) to be positive (nonnegative least squares regression).

While the classical regression has an analytical solution, this variant does not. However, it’s just an extra constraint for the QP:

constraints <- list(betaHat >= 0)
problem <- Problem(objective, constraints)

result <- solve(problem)
result$getValue(betaHat) %>% round(3)
##        [,1]
##  [1,] 0.000
##  [2,] 0.000
##  [3,] 0.000
##  [4,] 0.000
##  [5,] 1.237
##  [6,] 0.623
##  [7,] 2.123
##  [8,] 2.804
##  [9,] 4.445
## [10,] 5.207

Another eg. If we would like to force \(2*\beta_2 + \beta_3 = 0\):

A <- matrix(c(0, 2, 1, rep(0, 7)), nrow = 1)
constraints <- list(A %*% betaHat == 0)
problem <- Problem(objective, constraints)

result <- solve(problem)
result$getValue(betaHat) %>% round(3)
##         [,1]
##  [1,] -3.526
##  [2,]  0.262
##  [3,] -0.524
##  [4,] -1.078
##  [5,]  0.797
##  [6,]  0.319
##  [7,]  1.836
##  [8,]  3.512
##  [9,]  4.187
## [10,]  5.478

Support Vector Classifier

Let’s make some data:

set.seed(101)
n <- 20
X1 <- c(rnorm(n), rnorm(n, mean=1.5))
X2 <- c(rnorm(n), rnorm(n, mean=1.5))
X <- matrix(c(X1,X2), ncol=2)
Y <- c(rep(1,n), rep(-1,n))

ggplot() + 
  geom_point(mapping = aes(x = X[,1], y = X[,2]), color = Y+2, size = 2) +
  labs(x = "x1", y = "x2") +
  theme_bw()

The Support Vector Classifier (SVC) is an affine function (hyperplane) that separates two sets of points by the widest margin. When the sets are not linearly separable, the SVC is determined by a trade-off between the width of the margin and the number of points that are misclassified.

Variable \(\xi\) is the amount by which a point can violate the separating hyperplane. Its cost is the penalty \(C>0\) (as \(C\) increases, fewer misclassifications will be allowed).

The model is:

minimize \(\frac{1}{2} \Vert \beta \Vert^2 + C\sum_{i=1}^m \xi_i\)

subject to

  • \(\xi_i \geq 0\)

  • \(y_i(x_i^T\beta + \beta_0) \geq 1 - \xi_i, \quad i = 1,\ldots,m\)
C <- 100
beta0 <- Variable()
beta  <- Variable(2)
slack <- Variable(2*n)

# Form problem
objective <- (1/2) * sum_squares(vstack(beta, beta0)) + C * sum(slack)
constraints <- list(Y * (X %*% beta + beta0) >= 1 - slack, slack >= 0)
problem <- Problem(Minimize(objective), constraints)
solution <- solve(problem)

To plot the separating hyperplane (here, a line) we need to get the \(\beta\) to find its slope and intercept:

beta.hat  <- solution$getValue(beta)
beta0.hat <- solution$getValue(beta0)

ggplot() + 
  geom_point(mapping = aes(x = X[,1], y = X[,2]), color = Y+2, size = 2) +
  geom_abline(slope = -beta.hat[1,]/beta.hat[2,], intercept = -beta0.hat/beta.hat[2,]) +
  labs(x = "x1", y = "x2") +
  theme_bw()

Vector Optimization

Vector optimization occurs when the objective function \(f_0(x)\) to minimize is vector-valued, i.e., \(x \in \mathbb{R}^n\).

For this problem to work there must be a proper cone \(K\) in order to compare objective values. If \(x,y\) are two feasible points (they satisfy the contraints), the respective objective values \(f_0(x), f_0(y)\) are to be compared using the inequality \(\preceq_K\). If \(f_0(x) \preceq_K f_0(y)\) the we interpret that by saying \(x\) is better or equal than \(y\).

The main problem is that \(f_0(x), f_0(y)\) might not be comparable, something that cannot happen in scalar objective functions.

If for every feasible \(y\), \(f_0(x) \preceq_K f_0(y)\) we say that \(x\) is a optimal value. If it exists, it will be unique (\(K\) is a cone). This, however, is not common.

A feasible \(x\) is Pareto optimal if, for every feasible \(y\), \(f_0(y) \preceq_K f_0(x) \implies f_0(y) = f_0(x)\). That is, among all comparable values, \(x\) is the smallest.

A vector optimization problem usually has many Pareto optimal values. Every Pareto optimal value lies in the boundary of the set of feasible objective values.

Scalarization

Scalarization is a technique to find Pareto optimal values.

Chose any vector \(\lambda \succ_K 0\) and transform the objective function into

minimize \(f_0(x) = \lambda^T f_0(x)\)

If \(x^*\) is the optimal value of this problem, then \(x^*\) is a Pareto optimal for the original vector optimization problem.

Vector \(\lambda\) is called the weight vector, satisfying \(\lambda \succ_K 0\). It is a free parameter which we can vary to obtain different Pareto optimal solutions. However, there might be Pareto optimal solutions that cannot be found using scalarization.