poissonisfish

2019-05-03T04:03:07+00:00

Thanks, so nice and useful

LikeLiked by 1 person

Reply

2019-05-03T13:08:26+00:00

Great post! Thanks!

LikeLiked by 1 person

Reply

2019-05-07T08:08:04+00:00

I’m getting an error message when executing:

fro %
as.data.frame()

Until then, all the instructions were working fine.
Any suggestions?

LikeLike

Reply

2019-05-07T08:14:02+00:00

Hi Ruben! I warned against the corrupted code blocks in the very first paragraph. Seemingly, the WordPress code editor does not like the pipe operators. Follow this link instead and use the scripts therein for implementing the models. Let me know if it works now!
https://github.com/monogenea/cuckooParasitism
Cheers,
Francisco

LikeLike

Reply

2019-05-09T07:26:46+00:00

Hola Francisco,
my bad! I missed the warning.
Executing the code from the repository works as expected.
Thanks a lot!

LikeLiked by 1 person

Reply

2019-07-29T10:25:01+00:00

Thanks, this was great. I’m even more inclined to try reading that book now!

LikeLike

Reply

2019-07-29T11:10:27+00:00

Hi William, glad you liked it! The book is really good, try buying or borrowing a copy if you can

LikeLike

Reply

2019-09-24T22:01:24+00:00

Great read Francisco, will certainly have a look at Statistical Rethinking!

LikeLike

Reply

Pingback: Audio classification in R – poissonisfish

2021-02-03T16:12:22+00:00

Hi Francisco,

Your post is very useful as I now better understand how Bayesian inference works. But I have a question about the simulation to recover the parameters of $Normal(5,2)$ from a sample of 100 observations. When I simply compute the mean and standard deviation of randomSample (ie mean(randomSample) and sd(randomSample)), I get 5.006 and 2.041, which are already very close to the true values. So the bayesian model does not seem to result in an increase in accuracy (mean(grid$mu[postSample]) is 4.99, and mean(grid$sigma[postSample]) is 2.079). But maybe I have missed something.

LikeLike

Reply

2021-02-03T17:58:07+00:00

Hi Noel, thanks for your comment. Yes, the observation that the sample mean and standard deviation converge to the actual known parameters with larger sample sizes is unsurprising – law of large numbers. In fact, the strength of the Bayesian approach is not so much the accuracy as you point out, but that it quantifies the likelihood over a whole bivariate distribution of parameter values that could give rise to the observed sample. Also keep in mind this simulation is a very tidy setting, free of unknowns and extraneous sources of variation which do emerge in most real-world problems. Hope this helps? Cheers

LikeLike

Reply

2021-02-03T19:11:13+00:00

Wow, that’s very clear and helpful, thanks a lot Francisco!! Cheers from Aix-en-Provence

LikeLike

2021-03-15T16:55:46+00:00

Hi Francisco,

Thank you for the fantastic post!

I’m working my way through the details now and had a question regarding your simulated example where you sampled data from a normal distribution with trueMu = 5 and trueSig = 2. When we want to estimate these parameters given the observed data stored in randomSample, you mention in Step 2 that we need to compute the likelihood function for all possible combinations of mu and sigma in the 200×200 grid. Looking at the code when we compute the log-likelihood function (lines 12-15), it appears the log-likelihoods are being calculated for 200 pairs – i.e., (mu_{1}, sigma_{1}),…,(mu_{200},sigma_{200}) – and not for all 40,000 possible pairings of mu and sigma in the 200×200 grid. If you have a moment, can you clarify whether all 40,000 possible combinations need to be used, or if the 200 pairs are what’s required when performing a Bayesian multi-parameter estimation?

Thanks again!

LikeLike

Reply

2021-03-15T17:04:02+00:00

Hi Gordon, thanks for the kind words. Effectively grid is 40,000 x 2, therefore covering all configurations. When you loop over nrow(grid) you run 40,000 iterations. Maybe you can double-check this with dim(grid)? Best, Francisco

LikeLike

Reply

2021-03-15T17:43:47+00:00

I retract my question, Francisco. I see that you used the expand.grid() function when defining the 200×200 grid. I originally used data.frame() for this step and that’s what led to my error. Thanks again for your post!

LikeLike

Reply

2021-03-15T17:46:14+00:00

Thank you as well for your reply! I didn’t refresh the page before I entered my previous reply, so didn’t see your response. All is well now. Thank you for taking the time to clarify this important step for me.

LikeLike

Reply

2021-10-21T14:16:42+00:00

Dear Franciso,

Could you please answer a question that has been puzzling me. On your Bayesian roulette example, the line of code:
lines(rangeP, dnorm(x = rangeP, mean = .5, sd = .1) / 15, col = “red”)

What is the purpose of dividing dnorm distribution by 15 ?

Thanks
Ken

LikeLike

Reply

2021-10-21T14:41:59+00:00

Hi Ken, thanks for your question. Dividing by 15 is solely for scaling down the distribution and make the visual comparison easier. As mentioned, multiplication by scalars does not affect distribution shape, at most its density. Hope this makes sense! Best

LikeLike

Reply

2021-10-21T15:14:22+00:00

Ah, I see! Thank you.

Regards
Ken

LikeLike

Reply

	unstdPost <- lik * prior
	stdPost <- unstdPost / sum(unstdPost)
	lines(rangeP, stdPost, col = "blue")
	legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
	text.col = 1:4, bty = "n")

	# Define real pars mu and sigma, sample 100x
	trueMu <- 5
	trueSig <- 2
	set.seed(100)
	randomSample <- rnorm(100, trueMu, trueSig)

	# Grid approximation, mu %in% [0, 10] and sigma %in% [1, 3]
	grid <- expand.grid(mu = seq(0, 10, length.out = 200),
	sigma = seq(1, 3, length.out = 200))

	# Compute likelihood
	lik <- sapply(1:nrow(grid), function(x){
	sum(dnorm(x = randomSample, mean = grid$mu[x],
	sd = grid$sigma[x], log = T))
	})

	# Multiply (sum logs) likelihood and priors
	prod <- lik + dnorm(grid$mu, mean = 0, sd = 5, log = T) +
	dexp(grid$sigma, 1, log = T)

	# Standardize the lik x prior products to sum up to 1, recover unit
	prob <- exp(prod - max(prod))

	# Sample from posterior dist of mu and sigma, plot
	postSample <- sample(1:nrow(grid), size = 1e3, prob = prob)
	plot(grid$mu[postSample], grid$sigma[postSample],
	xlab = "Mu", ylab = "Sigma", pch = 16, col = rgb(0,0,0,.2))
	abline(v = trueMu, h = trueSig, col = "red", lty = 2)

	library(rethinking)
	library(tidyverse)
	library(magrittr)
	library(readxl)

	# Download data set from Riehl et al. 2019
	dataURL <- "https://datadryad.org/stash/downloads/file_stream/82205"
	download.file(dataURL, destfile = "data.xlsx")

	(allTabs <- excel_sheets("data.xlsx")) # list tabs

	# Read female reproductive output
	fro <- read_xlsx("data.xlsx", sheet = allTabs[2])

	# Assess missingness
	sum(complete.cases(fro)) / nrow(fro)
	# only 0.57 complete records; which vars have at least one NA?
	names(which(apply(fro, 2, function(x){any(is.na(x))})))

	# Filter out missingness in fledged eggs, the model does not cope with it
	fro %<>% slice(which(!is.na(Eggs_fledged))) %>%
	as.data.frame()

	fro %<>% mutate(female_id = as.integer(factor(Female_ID_coded)),
	year_id = as.integer(factor(Year)),
	group_id = as.integer(factor(Group_ID_coded)),
	Min_age_Z = scale(Min_age),
	Group_size_Z = scale(Group_size),
	Mean_eggsize_Z = scale(Mean_eggsize))

	eggsFMod <- map2stan(alist(
	Eggs_fledged ~ dzipois(p, lambda),
	logit(p) <- ap,
	log(lambda) <- a + a_fem[female_id] + a_year[year_id] + a_group[group_id] +
	ParasitebP + Min_age_ZbA + Group_size_ZbGS + Mean_eggsize_ZbES +
	ParasiteMin_age_ZbPA,
	Group_size_Z ~ dnorm(0, 3),
	Mean_eggsize_Z ~ dnorm(0, 3),
	a_fem[female_id] ~ dnorm(0, sigma1),
	a_year[year_id] ~ dnorm(0, sigma2),
	a_group[group_id] ~ dnorm(0, sigma3),
	c(sigma1, sigma2, sigma3) ~ dcauchy(0, 1),
	c(ap, a) ~ dnorm(0, 3),
	c(bP, bA, bGS, bES, bPA) ~ dnorm(0, 2)),
	data = fro,
	iter = 5e3, warmup = 1e3, chains = 4, cores = 4)

	# Check posterior dists
	precis(eggsFMod, prob = .95) # use depth = 2 for varying intercepts

poissonisfish

Bayesian models in R

Introduction

Frequentist perspective

Bayesian perspective

Step 1. All possible ways (likelihood distribution)

Step 2. Update your belief (prior distribution)

Step 3. Make it sum up to one (standardising the posterior)

Simulation

Bayesian models & MCMC

`greta` vs. `rethinking`

Imputation

Syntax

Models available

HMC sampling

Plotting methods

Let’s get started with R

Zero-inflated Poisson regression of fledged egg counts

Poisson regression of laid egg counts

Logistic regression of female reproductive success

Wrap-up

References

Citation

19 thoughts on “Bayesian models in R”

Leave a comment Cancel reply

	rangeP <- seq(0, 1, length.out = 100)
	plot(rangeP, dbinom(x = 8, prob = rangeP, size = 10),
	type = "l", xlab = "P(Black)", ylab = "Density")

	lines(rangeP, dnorm(x = rangeP, mean = .5, sd = .1) / 15,
	col = "red")

	lik <- dbinom(x = 8, prob = rangeP, size = 10)
	prior <- dnorm(x = rangeP, mean = .5, sd = .1)
	lines(rangeP, lik * prior, col = "green")

	# Sample posterior
	post <- extract.samples(eggsFMod)
	# PI of P(no clutch at all)
	dens(logistic(post$ap), show.HPDI = T, xlab = "ZIP Bernoulli(p)")

	# Run simulations w/ averages of all predictors, except parasite 0 / 1
	lambdaNoP <- exp(post$a + 0post$bP + 0post$bA +
	0post$bGS + 0post$bES + 00post$bPA)
	simFledgeNoPar <- rpois(n = length(lambdaNoP), lambda = lambdaNoP)

	lambdaP <- exp(post$a + 1post$bP + 0post$bA +
	0post$bGS + 0post$bES + 10post$bPA)
	simFledgePar <- rpois(n = length(lambdaP), lambda = lambdaP)
	table(simFledgeNoPar)
	table(simFledgePar)

	# Simulate with varying age
	rangeA <- seq(-3, 3, length.out = 100)
	# No parasite
	predictions <- sapply(rangeA, function(x){
	exp(post$a + 0post$bP + xpost$bA + 0*post$bGS +
	0post$bES + 0x*post$bPA)
	})
	hdpiPois <- apply(predictions, 2, HPDI, prob = .95)
	meanPois <- colMeans(predictions)
	plot(rangeA, meanPois, type = "l", ylim = c(0, 3), yaxp = c(0, 3, 3),
	xlab = "Min Age (standardized)", ylab = expression(lambda))
	shade(hdpiPois, rangeA)
	# Parasite
	predictionsP <- sapply(rangeA, function(x){
	exp(post$a + 1post$bP + xpost$bA + 0*post$bGS +
	0post$bES + xpost$bPA)
	})
	hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95)
	meanPoisP <- colMeans(predictionsP)
	lines(rangeA, meanPoisP, lty = 2, col = "red")
	shade(hdpiPoisP, rangeA, col = rgb(1,0,0,.25))

	# Try Eggs_laid ~ dpois
	froReduced <- slice(fro, which(!is.na(Eggs_laid))) %>%
	as.data.frame()

	# Re-do the variable scaling, otherwise the sampling throws an error
	froReduced %<>% mutate(female_id = as.integer(factor(Female_ID_coded)),
	year_id = as.integer(factor(Year)),
	group_id = as.integer(factor(Group_ID_coded)),
	Min_age_Z = scale(Min_age),
	Group_size_Z = scale(Group_size),
	Mean_eggsize_Z = scale(Mean_eggsize))

	eggsLMod <- map2stan(alist(
	Eggs_laid ~ dpois(lambda),
	log(lambda) <- a + a_fem[female_id] + a_year[year_id] + a_group[group_id] +
	ParasitebP + Min_age_ZbA + Group_size_ZbGS + Mean_eggsize_ZbES +
	ParasiteMin_age_ZbPA,
	Group_size_Z ~ dnorm(0, 3),
	Mean_eggsize_Z ~ dnorm(0, 3),
	a_fem[female_id] ~ dnorm(0, sigma1),
	a_year[year_id] ~ dnorm(0, sigma2),
	a_group[group_id] ~ dnorm(0, sigma3),
	c(sigma1, sigma2, sigma3) ~ dcauchy(0, 1),
	a ~ dnorm(0, 3),
	c(bP, bA, bGS, bES, bPA) ~ dnorm(0, 2)),
	data = froReduced,
	iter = 5e3, warmup = 1e3, chains = 4, cores = 4)

	# Check posterior dists
	precis(eggsLMod, prob = .95)

	# Sample posterior
	post <- extract.samples(eggsLMod)

	# Run simulations w/ averages of all predictors, except parasite 0 / 1
	lambdaNoP <- exp(post$a + 0post$bP + 0post$bA +
	0post$bGS + 0post$bES + 00post$bPA)
	simFledgeNoPar <- rpois(n = length(lambdaNoP), lambda = lambdaNoP)

	lambdaP <- exp(post$a + 1post$bP + 0post$bA +
	0post$bGS + 0post$bES + 10post$bPA)
	simFledgePar <- rpois(n = length(lambdaP), lambda = lambdaP)
	table(simFledgeNoPar)
	table(simFledgePar)

	# Sim with varying age
	# No parasite
	predictions <- sapply(rangeA, function(x){
	exp(post$a + 0post$bP + xpost$bA + 0*post$bGS +
	0post$bES + 0x*post$bPA)
	})
	hdpiPois <- apply(predictions, 2, HPDI, prob = .95)
	meanPois <- colMeans(predictions)
	plot(rangeA, meanPois, type = "l", ylim = c(0, 7), yaxp = c(0, 7, 7),
	xlab = "Min Age (standardized)", ylab = expression(lambda))
	shade(hdpiPois, rangeA)
	# Parasite
	predictionsP <- sapply(rangeA, function(x){
	exp(post$a + 1post$bP + xpost$bA + 0*post$bGS +
	0post$bES + xpost$bPA)
	})
	hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95)
	meanPoisP <- colMeans(predictionsP)
	lines(rangeA, meanPoisP, lty = 2, col = "red")
	shade(hdpiPoisP, rangeA, col = rgb(1,0,0,.25))

	# Bonus! Sample counts from predictionsP, take 95% HDPI
	hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95)
	meanPoisP <- colMeans(predictionsP)
	plot(rangeA, meanPoisP, type = "l", ylim = c(0, 15),
	yaxp = c(0, 15, 5), xlab = "Min Age (standardized)",
	ylab = expression(paste(lambda, " / no. eggs laid")))
	shade(hdpiPoisP, rangeA)
	poisample <- sapply(1:100, function(k){
	rpois(nrow(predictionsP), predictionsP[,k])
	})
	hdpiSample <- apply(poisample, 2, HPDI, prob = .95)
	shade(hdpiSample, rangeA)

	library(tensorflow)
	use_condaenv("greta")
	library(greta)
	library(tidyverse)
	library(bayesplot)
	library(readxl)

	# Read female reproductive output and discard records w/ NAs
	fro <- read_xlsx("data.xlsx", sheet = allTabs[2])
	fro <- fro[complete.cases(fro),]

	# Use cross-classified varying intercepts for year, female ID and group ID
	female_id <- as.integer(factor(fro$Female_ID_coded))
	year <- as.integer(factor(fro$Year))
	group_id <- as.integer(factor(fro$Group_ID_coded))

	# Define and standardize model vars
	Age <- as_data(scale(fro$Min_age))
	Eggs_laid <- as_data(scale(fro$Eggs_laid))
	Mean_eggsize <- as_data(scale(fro$Mean_eggsize))
	Group_size <- as_data(scale(fro$Group_size))
	Parasite <- as_data(fro$Parasite)

	# Define model effects
	sigmaML <- cauchy(0, 1, truncation = c(0, Inf), dim = 3)
	a_fem <- normal(0, sigmaML[1], dim = max(female_id))
	a_year <- normal(0, sigmaML[2], dim = max(year))
	a_group <- normal(0, sigmaML[3], dim = max(group_id))

	a <- normal(0, 5)
	bA <- normal(0, 3)
	bEL <- normal(0, 3)
	bES <- normal(0, 3)
	bGS <- normal(0, 3)
	bP <- normal(0, 3)
	bPA <- normal(0, 3)

	# Model setup
	mu <- a + a_fem[female_id] + a_year[year] + a_group[group_id] +
	AgebA + Eggs_laidbEL + Mean_eggsizebES + ParasitebP +
	Group_sizebGS + ParasiteAge*bPA

	p <- ilogit(mu)
	distribution(fro$Successful) <- bernoulli(p)
	cuckooModel <- model(a, bA, bEL, bES, bP, bGS, bPA)

	# Plot
	plot(cuckooModel)

	# HMC sampling
	draws <- mcmc(cuckooModel, n_samples = 4000,
	warmup = 1000, chains = 4, n_cores = 10)
	# Trace plots
	mcmc_trace(draws)
	# Parameter posterior
	mcmc_intervals(draws, prob = .95)

	# Simulation with average eggs laid, egg size and group size, w/ and w/o parasitism
	seqX <- seq(-3, 3, length.out = 100)
	probsNoPar <- sapply(seqX, function(x){
	scenario <- ilogit(a + x*bA)
	probs <- calculate(scenario, draws)
	return(unlist(probs))
	})
	probsPar <- sapply(seqX, function(x){
	scenario <- ilogit(a + xbA + bP + xbPA)
	probs <- calculate(scenario, draws)
	return(unlist(probs))
	})

	plot(seqX, apply(probsNoPar, 2, mean), type = "l", ylim = 0:1,
	xlab = "Min age (standardized)", ylab = "P(Successful)",
	yaxp = c(0, 1, 2))
	rethinking::shade(apply(probsNoPar, 2, rethinking::HPDI, prob = .95),
	seqX)
	lines(seqX, apply(probsPar, 2, mean), lty = 2, col = "red")
	rethinking::shade(apply(probsPar, 2, rethinking::HPDI, prob = .95),
	seqX, col = rgb(1,0,0,.25))

Introduction

Frequentist perspective

Bayesian perspective

Step 1. All possible ways (likelihood distribution)

Step 2. Update your belief (prior distribution)

Step 3. Make it sum up to one (standardising the posterior)

Simulation

Bayesian models & MCMC

greta vs. rethinking

Imputation

Syntax

Models available

HMC sampling

Plotting methods

Let’s get started with R

Zero-inflated Poisson regression of fledged egg counts

Poisson regression of laid egg counts

Logistic regression of female reproductive success

Wrap-up

References

Citation

Share this:

19 thoughts on “Bayesian models in R”

Leave a comment Cancel reply

`greta` vs. `rethinking`