--- title: "Estimating sample means" author: "Luce" date: "19 September, 2023" output: html_document: toc: yes pdf_document: toc: yes editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Loops Computers are very good at dealing with repetitive tasks. Often we would like to perform a task for some number of objects, or a certain number of times. While we could just repeat the same command over and over again, changing one parameter each time, a better and less error-prone way is to write a loop. Here we will look at using a `for` loop, which takes some number of elements and runs a series of commands for each element. The `for` loop starts by assigning the first element to a user-defined variable, and executing the statements in the body of the code. Then, if there are more elements, it goes back to step 1, assigns the next element to the variable, and so on, until there are no more elements, at which point it exits the `for` loop and goes on to the next statement. ### My first `for` loop ```{r first_loop} for (i in 1:5) { remainder <- i %% 2 squared <- i * i if (remainder == 0) { print(i) print("Even") } else { print(i) print("Odd") print(squared) } } ``` While it is very common to loop over numeric vectors, a `for` loop can iterate over almost any kind of R object. ```{r character_vector_loop} for (name in c("Penelope", "Peter", "Paul")) { print(nchar(name)) } ``` ## Efficient loops We saw an example of a loop in the last class, when we used numerical simulation to determine how often the true mean was found outside of the 95% confidence interval of the mean. ```{r count_bad_means} set.seed(0) d <- data.frame(lower = numeric(0), mean = numeric(0), upper = numeric(0)) for (i in 1:100) { x <- rnorm(10000) d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x)) } (bad_estimate_count <- length(which(d$lower > 0 | d$upper < 0))) ``` We will go into more detail on code quality in a later lecture, and while the most important thing is always to get your code running first, and then work on efficiency, there are a few things that we could do to improve this piece of code. Firstly, for loops in R often run more slowly than you might expect because, as in the code above, when we add a new row, we have to re-assign the data frame to a new (larger) object (which involves copying all of the existing rows). Much more efficient is to assign the expected size of the data frame first, and then fill it in as you go through the loop. When we are doing 100 iterations, this doesn't make much difference, but can become very significant at scale. ```{r assigned_dataframe} set.seed(0) num_iterations <- 100 d <- data.frame(lower = numeric(num_iterations), mean = numeric(num_iterations), upper = numeric(num_iterations)) for (i in 1:num_iterations) { x <- rnorm(10000) d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x)) } (bad_estimate_count <- length(which(d$lower > 0 | d$upper < 0))) ``` In R Markdown, we can even populate our text with values that are generated on-the-fly, as the document is knitted: Out of `r num_iterations` iterations, we found `r length(which(d$lower > 0 | d$upper < 0))` 95% CIs which did not contain the true mean (we can do the calculation in the text!). ### Making a function But hardcoding the number of iterations is clunky. Wouldn't it be nicer if the user could say how often they wanted to run the simulation? We can do this by making it into a function, which takes as an argument the number of iterations, and returns the number of "wrong" means. ```{r means_function} set.seed(0) runSimulation <- function(num_iterations = 100) { d <- data.frame(lower = numeric(num_iterations), mean = numeric(num_iterations), upper = numeric(num_iterations)) for (i in 1:num_iterations) { x <- rnorm(10000) d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x)) } length(which(d$lower > 0 | d$upper < 0)) } runSimulation(100) ``` **Bonus question**: Why do we not want to `set.seed()` inside the function? ## Central Limit Theorem In the last lecture, we stated that, even if the distribution that we were sampling from was not normal, the distribution of the estimated means would be! *How might we go about testing this statement?* If we could get our hands on the means that are generated by the loop, we could examine the raw data directly. Instead of doing the comparison of sample mean versus true mean within the function, we could return the data frame instead... ```{r return_estimated_means} runSimulation <- function(num_iterations = 100) { d <- data.frame(lower = numeric(num_iterations), mean = numeric(num_iterations), upper = numeric(num_iterations)) for (i in 1:num_iterations) { x <- rnorm(10000) d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x)) } return(d) } ``` ... and now we can look at the distribution of the estimated means. ```{r plot_means} est_means <- runSimulation(1000) str(est_means) summary(est_means) plot(density(est_means$mean)) ``` ### Normal versus uniform underlying distributions Now we are in a position to test our hypothesis. We would like to compare the above distribution of estimated means of samples from a normal distribution, to estimated means of samples from a uniform distribution. First let's take a quick look at what the density plot for a uniform distribution looks like: ```{r uniform_distribution_eg} plot(density(runif(10000))) ``` We create a new function, this time sampling from a uniform distribution (we will learn more about other distributions in the next lecture). The only change we make in our code is to replace the `rnorm()` function with `runif()`. ```{r uniform_distribution} runUnifSimulation <- function(num_iterations = 100) { d <- data.frame(lower = numeric(num_iterations), mean = numeric(num_iterations), upper = numeric(num_iterations)) for (i in 1:num_iterations) { x <- runif(10000) d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x)) } return(d) } est_unif_means <- runUnifSimulation(1000) plot(density(est_unif_means$mean)) ``` We might want to compare the two plots side by side, maybe even adding a stripchart (which gets less informative as the number of points increases!) ```{r, echo = FALSE} par(mfrow = c(1,2)) plot(density(est_means$mean), main = "Sampling\nNormal distribution") rug(est_means$mean) plot(density(est_unif_means$mean), main = "Sampling\nUniform distribution") rug(est_unif_means$mean) par(mfrow = c(1,1)) ```