Gradient descent towards minimising prediction error with two parameters (red to blue) [source]

It has been well over a year since my last entry, I have been rather quiet because someone has been rather loud 👶 Just last week I found some time to rewrite a draft on gradient descent from about two years ago, so here we are – back in business!

Gradient descent is a fundamental concept in machine learning, and therefore one practitioners should be familiar with. In this post we will explore gradient descent from both theoretical and practical perspectives. First, we will establish some general definitions, review cost functions in the context of regression and binary classification, and introduce the chain rule of calculus. Then, we will put it all into practice to build a linear and a logistic regression models from the ground up. This is a short, introductory guide where a basic knowledge of statistics and calculus should be most helpful. Ready, set, go! ▶️

Introduction

Most predictive algorithms learn by estimating a set of parameters $\beta$ – also known as weights, or coefficients – that minimise the cost function, a measure of prediction error. In gradient-based learning, that search process invariably employs gradient descent or one of its numerous extensions. Gradients are partial derivatives of the cost function $J(\beta)$ with respect to each model parameter, $\frac{\partial J}{\partial \beta}$ . On a high level, gradient descent is an iterative procedure that computes predictions and updates parameter estimates by subtracting their corresponding gradients weighted by a learning rate $\alpha$ . This procedure will find estimates that incur ever smaller prediction errors, provided i) the learning rate is adequate, ii) the maximum number of iterations is sufficient and iii) the underlying dataset is suitable for modelling.

Gradient descent. (Left) In the course of many iterations, the update equation is applied to each parameter simultaneously. When the learning rate is fixed, the sign and magnitude of the update fully depends on the gradient. (Right) The first three iterations of a hypothetical gradient descent, using a single parameter. The red circle highlights the parameter estimate at each iteration, dotted grey lines denote the corresponding gradient sign and magnitude

How gradient descent works will become clearer once we establish a general problem definition, review cost functions and derive gradient expressions using the chain rule of calculus, for both linear and logistic regression.

Problem definition 📋

We start by establishing a general, formal definition. Suppose we have a matrix $X \in \mathbb{R}^{n \times p}$ carrying p predictors observed across n observations, and the vector of parameters $\beta \in \mathbb{R}^{p+1}$ comprising the model intercept and the slopes from each individual predictor. Our aim is to predict some target y using a combination of $X$ and $\beta$ :

In linear regression, we would like to predict the continuous target variable $y \in \mathbb{R}^{n}$ . The model prediction from any observation i takes the form $\hat{y_i} = \beta_0 + \beta_1x^1_i + ... + \beta_px^p_i$
In logistic regression, we would like to predict the binary target variable $y \in \{0,1\}$ . The model prediction from any observation i takes the form $\hat{y}_i = \sigma(z) = \frac{1}{1 + e^{-z}}$ , where $z = log(\frac{P(y_i = 1|x_i)}{1 - P(y_i = 1|x_i)}) = \beta_0 + \beta_1 x_i^1 + ... + \beta_p x_i^p$ is the logit of y, and $\sigma$ is the logistic function that transforms logit to target probability

Note that in linear and logistic regression, in one shot you can directly access all n predictions and logit values, respectively, via the matrix product $X \beta$ after appending a column of ones to the left of $X$ .

Cost function ⚖️

In both linear and logistic regression, the aim is to find the parameter estimates that minimise the prediction error. This error is generally measured using a cost function:

In linear regression, one can use the mean squared error (MSE), $J(\beta) = \frac{\sum^n_{i=1} (\hat{y_i} - y_i)^2}{2n}$ . It measures the average squared difference between the observed and predicted target values (lower is better)
In logistic regression, one can use the binary cross-entropy, $J(\beta) = - \sum^n_{i=1} y_ilog(\hat{y}) + (1-y_i)log(1-\hat{y})$ . It measures the amount of uncertainty shared between the observed and estimated probability distributions (lower is better)

Putting it together, we have a recipe to predict a target $y$ using $X$ and $\beta$ , and measure the error between resulting prediction $\hat{y}$ and the target. What are we missing then?

Gradient descent ⚙️

As we saw earlier, gradient descent is mainly driven by the partial derivatives of the cost function with respect to each individual parameter, $\frac{\partial J}{\partial \beta}$ . So, how do we calculate these?

To this end, we can use of the chain rule of calculus, which posits that $\frac{\partial}{\partial x}f(g(h(x))) = \frac{\partial f}{\partial g} . \frac{\partial g}{\partial h} . \frac{\partial h}{\partial x}$ for any number of functions, to simplify the derivative of a cost function with respect to any parameter. Assuming $f(x, \beta) = \beta_0 + \beta_1x^1 + \beta_2x^2 + ... + \beta_px^p$ , which turns out to be $\hat{y}$ and the logit $z$ for linear and logistic regressions respectively, we find that:

For linear regression using a MSE cost function $J(\beta) = \frac{\sum^n_{i=1} (\hat{y_i} - y_i)^2}{2n}$ , we can decompose the derivative of the cost function into two parts, $\frac{\partial J}{\partial \beta_j} = \frac{\partial J}{\partial f}. \frac{\partial f}{\partial \beta_j} = \frac{\sum^n_{i=1} (\hat{y_i} - y_i)}{n} . x^j_i$
For logistic regression using a binary cross-entropy cost function $J(\beta) = - \sum^n_{i=1} y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})$ , we can decompose the derivative of the cost function into three parts, $\frac{\partial J}{\partial \beta_j} = \frac{\partial J}{\partial \sigma} . \frac{\partial \sigma}{\partial f} . \frac{\partial f}{\partial \beta_j} = \sum^n_{i=1} (-\frac{y_i}{\hat{y_i}}+\frac{1-y_i}{1-\hat{y_i}}) . (\hat{y_i}(1-\hat{y_i})) . x^j_{i}$ , or equivalently $\frac{\partial J}{\partial \beta_j} = \sum^n_{i=1} (\hat{y_i}-y_i) x^j_{i}$

In both cases the application of gradient descent will iteratively update the parameter vector $\mathbf{\beta}$ using the aforementioned equation $\beta := \beta - \frac{\partial J}{\partial \beta} \alpha$ . Also, it is worth noting that none of the predictors contributes to the intercept partial derivative, i.e. $\frac{\partial f}{\partial \beta_0} = 1$ .

Now, I encourage you to apply the chain rule and try deriving these gradient expressions by hand, using the provided definitions of $J(\beta)$ , $\sigma(z)$ and $f(x, \beta)$ accordingly ✏️

Let’s get started with R

Let us now apply these learnings to build two models using R:

Linear regression, to predict the age of orange trees from their circumference 🍊
Logistic regression, to predict engine shape from miles-per-gallon (mpg) and horsepower 🚘

In both cases we will implement batch gradient descent, where all training observations are used in each iteration. Mini-batch and stochastic gradient descent are popular alternatives that use instead a random subset or a single training observation, respectively, making them computationally more efficient when handling large sample sizes. Make sure that you have RColorBrewer installed.

Linear regression 🍊

Our first model, based on the Orange dataset, will have the following structure:

$\text{age} \sim \beta_0 + \beta_1 \text{circumference}$

In the code below we will configure gradient descent such that in each of 25 iterations, a prediction is made and the two parameters $\beta_0$ and $\beta_1$ are updated using the gradient expressions presented earlier, using the learning rate $\alpha = 1 \times 10^{-4}$ . We will also designate the two parameters as b0 and b1, then plot and store their estimates under the lists intercepts and slopes, respectively. Finally, we will create a MSE cost function contour to visualise the trajectory of the same parameter estimates in the course of gradient descent.

	# Sun Nov 14 19:58:50 2021 ——————————

	# Regression Oranges
	library(RColorBrewer)

	data("Orange")

	# Determine number of iterations
	niter <- 25
	# Determine learning rate / step size
	alpha <- 1e-4
	set.seed(1)
	b0 <- rnorm(1) # intercept
	b1 <- rnorm(1) # slope

	# Set palette
	cols <- colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(niter)

	# Plot
	layout(matrix(c(1,1,2,2,3), nrow = 1, ncol = 5))
	plot(age ~ circumference, data = Orange, pch = 16,
	xlab = "Circumference (mm)",
	ylab = "Age (days)",
	col = rgb(0,0,1,.5))

	# Perform gradient descent
	slopes <- rep(NA, niter)
	intercepts <- rep(NA, niter)
	for(i in 1:niter){
	# prediction
	y <- b0 + b1 * Orange$circumference
	# b0 = b0 – dJ/da * alpha
	b0 <- b0 – sum(y – Orange$age) / nrow(Orange) * alpha
	# b1 = b1 – dJ/db * alpha
	b1 <- b1 – sum((y – Orange$age) * Orange$circumference) / nrow(Orange) * alpha
	abline(a = b0, b = b1, col = cols[i], lty = 2)
	# Save estimates over all iterations
	intercepts[i] <- b0
	slopes[i] <- b1
	}
	title("Regression Fit")

	# Cost function contour
	allCombs <- expand.grid(b0 = seq(-50, 50, length.out = 100),
	b1 = seq(7.5, 8.5, length.out = 100))
	res <- matrix(NA, 100, 100)
	# a by rows, b by cols
	for(i in 1:nrow(allCombs)){
	y <- allCombs$b0[i] + allCombs$b1[i] * Orange$circumference
	res[i] <- sum((y – Orange$age)*2)/(2nrow(Orange))
	}

	# Plot contour
	contour(t(res), xlab = expression(beta[1]),
	ylab = expression(beta[0]), axes = F, nlevels = 25)
	axis(1, at = c(0, .5, 1), labels = c(7.5, 8, 8.5))
	axis(2, at = c(0, .5, 1), labels = c(-50, 0, 50))
	points((slopes – min(allCombs$b1)) / diff(range(allCombs$b1)),
	(intercepts – min(allCombs$b0)) / diff(range(allCombs$b0)),
	pch = 4, col = cols)
	title("MSE contour")

	# Add colorbar
	z = matrix(1:niter, nrow = 1)
	image(1, 1:niter, z,
	col = cols, axes = FALSE, xlab = "", ylab = "")
	title("Iteration No.")
	axis(2, at = c(1, seq(5, niter, by=5)))

view raw

1-grad.R

hosted with ❤ by GitHub

By colouring the iterations from blue (early) to red (late) using a diverging palette, we can see how gradient descent arrived at parameter estimates that minimise the prediction error! ✅ In addition, the contour suggests that an optimal regression fit can be obtained using a small-intercept, large-coefficient combination or vice versa. Finally, note that the variation in the values of $\beta_0$ in the right-hand panel is concealed by the scale used in the MSE contour plot, intentionally set for ease of contour visualisation.

Logistic regression 🚘

Our second model, based on the mtcars dataset, will have the following structure:

$z = \beta_0 + \beta_1 \text{mpg} + \beta_2 \text{horsepower}$

$\text{engine shape} \sim \frac{1}{1 - e^{-z}}$

which is reminiscent of the generalised linear model employed in the past introduction to Bayesian models. In the code below, we will configure gradient descent such that in each of 100 iterations, a prediction is made and the three parameters $\beta_0$ , $\beta_1$ and $\beta_2$ are updated using the gradient expressions presented earlier, using the learning rate $\alpha = 1 \times 10^{-5}$ . We will also designate the three parameters as b0, b1 and b2 and store their estimates under the lists intercepts, slopes1 and slopes2, respectively. Lastly, we will calculate and visualise the decision boundary from each iteration and examine how the parameter estimates change in the process.

CALCULATING THE DECISION BOUNDARY

The decision boundary can be determined from setting the logit expression $z$ to zero – equivalent to a target probability of 50% – then solving for either mpg or horsepower:

$\beta_0 + \beta_1 \text{mpg} + \beta_2 \text{horsepower} = 0 \Leftrightarrow \text{mpg} = \frac{-(\beta_2 \text{horsepower} + \beta_0)}{\beta_1}$

Solving for mpg here, we only need to pick at least two arbitrary values of horsepower, find their corresponding mpg values and connect the points with a line, in each iteration

	# Sun Nov 14 20:13:12 2021 ——————————

	# Logistic regression mtcars
	library(RColorBrewer)

	data("mtcars")

	# Determine number of iterations
	niter <- 100
	# Determine learning rate / step size
	alpha <- 1e-5
	set.seed(1)
	b0 <- rnorm(1) # intercept
	b1 <- rnorm(1) # slope1
	b2 <- rnorm(1) # slope2

	# Set palette
	cols <- colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(niter)

	# Plot
	layout(matrix(c(1,1,2,2,3), nrow = 1, ncol = 5))
	plot(mpg ~ hp, data = mtcars, pch = 16,
	xlab = "Horsepower", ylab = "Miles per gallon (mpg)",
	col = ifelse(mtcars$vs, rgb(0,0,1,.5), rgb(1,0,0,.5)))
	legend("topright", legend = c("Straight", "V-shaped"),
	col = c(rgb(0,0,1,.5), rgb(1,0,0,.5)), pch = 16,
	bty = "n")
	# Perform gradient descent
	slopes1 <- rep(NA, niter)
	slopes2 <- rep(NA, niter)
	intercepts <- rep(NA, niter)
	for(i in 1:niter){
	# prediction
	logit <- b0 + b1 * mtcars$mpg + b2 * mtcars$hp
	y <- 1 / (1 + exp(-logit))
	# b0 = b0 – dJ/da * alpha
	b0 <- b0 – sum(y – mtcars$vs) * alpha
	# b1 = b1 – dJ/db * alpha
	b1 <- b1 – sum((y – mtcars$vs) * mtcars$mpg) * alpha
	# b2 = b2 – dJ/db * alpha
	b2 <- b2 – sum((y – mtcars$vs) * mtcars$hp) * alpha
	# Solve for b0 + b1 * mtcars$mpg + b2 * mtcars$hp = 0
	# mtcars$mpg = -(b2 * mtcars$hp + b0) / b1
	db_points <- -(b2 * c(40, 400) + b0) / b1
	# Plot decision boundary
	lines(c(40, 400), db_points, col = cols[i], lty = 2)
	# Save estimates over all iterations
	intercepts[i] <- b0
	slopes1[i] <- b1
	slopes2[i] <- b2
	}
	title("Decision Boundary")

	plot(x=1:niter, intercepts, type = "l", lty=1,
	ylim = c(-1, 1), xlab = "Iteration No.", ylab = "Coefficient value")
	abline(h = 0, lty = 2, col="grey")
	lines(x=1:niter, slopes1, lty=3)
	lines(x=1:niter, slopes2, lty=4)
	legend("topright",
	legend = c(expression(beta[0]),
	expression(beta[1]),
	expression(beta[2])),
	lty = c(1, 3, 4),
	bty = "n")
	title("Coefficient Estimation")

	# Add colorbar
	z = matrix(1:niter, nrow = 1)
	image(1, 1:niter, z,
	col = cols, axes = FALSE, xlab = "", ylab = "")
	title("Iteration No.")
	axis(2, at = c(1, seq(20, niter, by=20)))

view raw

2-grad.R

hosted with ❤ by GitHub

Applying the same colouring scheme, we can see how gradient descent arrived at parameter estimates that minimise the prediction error! ✅ Note how the decision boundary neatly separates most members of the two engine shape classes. In addition, the right-hand panel shows how the slope parameters $\beta_1$ and $\beta_2$ change steadily, at different rates until they plateau, while the intercept is readily defined early on.

Conclusion

Well done, we have just learned and applied gradient descent! Time to recap all the steps taken:

A general problem definition was set to assign predictors, target, parameters and learning rate to well defined mathematical entities
The MSE and the binary cross-entropy cost functions were introduced
The chain rule of calculus was presented and applied to arrive at the gradient expressions based on linear and logistic regression with MSE and binary cross-entropy cost functions, respectively
For demonstration, two basic modelling problems were solved in R using custom-built linear and logistic regression, each based on the corresponding gradient expressions

Finally, here are some ideas of where to go from here:

Customisation. One major advantage of knowing the nuts and bolts of gradient descent is the high degree of customisation it allows. You could, for example, use the above recipe to add parameter regularisation to a cost function and derive the updated gradient expressions 🛠️
Interpretation. Although the focus of our two exercises was computational, I urge you to make sense of the resulting parameter estimates. It seems obvious that older trees have wider circumferences, not so much that V-shaped engines provide less mpg. Can we create models that explore causality? How much does a unit change in each predictor impact the target prediction? To me, these sort of questions serve as an enjoyable sanity check 💭
Experimentation. Use the code above as a template to experiment with different learning rates and number of iterations. Recall that the learning rate determines the step size of each parameter update, while the number of iterations determines how long learning takes place. Do your experiments conform to expected results? 🧪
Benchmarking. It could be interesting to compare our estimation results with those obtained from using lm() and glm() or similar. Do parameter estimates agree between? Which algorithm is faster? How do they differ? 📏

I hope this tutorial has rekindled your interest in machine learning, inspired novel applications and demystified the underlying mathematics. As usual, please enter a comment below – a question, a suggestion, praise or criticism. Until next time!

Citation

de Abreu e Lima, F (2023). poissonisfish: Gradient descent in R. From https://poissonisfish.com/2023/04/11/gradient-descent/

poissonisfish

Gradient descent in R

Introduction

Problem definition 📋

Cost function ⚖️

Gradient descent ⚙️

Let’s get started with R

Linear regression 🍊

Logistic regression 🚘

Conclusion

Citation

4 thoughts on “Gradient descent in R”

Leave a comment Cancel reply

Introduction

Problem definition 📋

Cost function ⚖️

Gradient descent ⚙️

Let’s get started with R

Linear regression 🍊

Logistic regression 🚘

Conclusion

Citation

Share this:

4 thoughts on “Gradient descent in R”

Leave a comment Cancel reply