Loops

Computers are very good at dealing with repetitive tasks. Often we would like to perform a task for some number of objects, or a certain number of times. While we could just repeat the same command over and over again, changing one parameter each time, a better and less error-prone way is to write a loop.

Here we will look at using a for loop, which takes some number of elements and runs a series of commands for each element. The for loop starts by assigning the first element to a user-defined variable, and executing the statements in the body of the code. Then, if there are more elements, it goes back to step 1, assigns the next element to the variable, and so on, until there are no more elements, at which point it exits the for loop and goes on to the next statement.

My first for loop

for (i in 1:5) {
  remainder <- i %% 2
  squared <- i * i
  if (remainder == 0) {
    print(i)
    print("Even")
  } else {
    print(i)
    print("Odd")
    print(squared)
  }
}
## [1] 1
## [1] "Odd"
## [1] 1
## [1] 2
## [1] "Even"
## [1] 3
## [1] "Odd"
## [1] 9
## [1] 4
## [1] "Even"
## [1] 5
## [1] "Odd"
## [1] 25

While it is very common to loop over numeric vectors, a for loop can iterate over almost any kind of R object.

for (name in c("Penelope", "Peter", "Paul")) {
  print(nchar(name))
}
## [1] 8
## [1] 5
## [1] 4

Efficient loops

We saw an example of a loop in the last class, when we used numerical simulation to determine how often the true mean was found outside of the 95% confidence interval of the mean.

set.seed(0)
d <- data.frame(lower = numeric(0), mean = numeric(0), upper = numeric(0))
for (i in 1:100) {
  x <- rnorm(10000)
  d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x))
}
(bad_estimate_count <- length(which(d$lower > 0 | d$upper < 0)))
## [1] 7

We will go into more detail on code quality in a later lecture, and while the most important thing is always to get your code running first, and then work on efficiency, there are a few things that we could do to improve this piece of code. Firstly, for loops in R often run more slowly than you might expect because, as in the code above, when we add a new row, we have to re-assign the data frame to a new (larger) object (which involves copying all of the existing rows). Much more efficient is to assign the expected size of the data frame first, and then fill it in as you go through the loop. When we are doing 100 iterations, this doesn’t make much difference, but can become very significant at scale.

set.seed(0)
num_iterations <- 100
d <- data.frame(lower = numeric(num_iterations), 
                mean  = numeric(num_iterations), 
                upper = numeric(num_iterations))
for (i in 1:num_iterations) {
  x <- rnorm(10000)
  d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x))
}
(bad_estimate_count <- length(which(d$lower > 0 | d$upper < 0)))
## [1] 7

In R Markdown, we can even populate our text with values that are generated on-the-fly, as the document is knitted: Out of 100 iterations, we found 7 95% CIs which did not contain the true mean (we can do the calculation in the text!).

Making a function

But hardcoding the number of iterations is clunky. Wouldn’t it be nicer if the user could say how often they wanted to run the simulation? We can do this by making it into a function, which takes as an argument the number of iterations, and returns the number of “wrong” means.

set.seed(0)
runSimulation <- function(num_iterations = 100) {
  d <- data.frame(lower = numeric(num_iterations), 
                  mean  = numeric(num_iterations), 
                  upper = numeric(num_iterations))
  for (i in 1:num_iterations) {
    x <- rnorm(10000)
    d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x))
  }
  length(which(d$lower > 0 | d$upper < 0))
}

runSimulation(100)
## [1] 7

Bonus question: Why do we not want to set.seed() inside the function?

Central Limit Theorem

In the last lecture, we stated that, even if the distribution that we were sampling from was not normal, the distribution of the estimated means would be!

How might we go about testing this statement?

If we could get our hands on the means that are generated by the loop, we could examine the raw data directly. Instead of doing the comparison of sample mean versus true mean within the function, we could return the data frame instead…

runSimulation <- function(num_iterations = 100) {
  d <- data.frame(lower = numeric(num_iterations), 
                  mean  = numeric(num_iterations), 
                  upper = numeric(num_iterations))
  for (i in 1:num_iterations) {
    x <- rnorm(10000)
    d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x))
  }
  return(d)
}

… and now we can look at the distribution of the estimated means.

est_means <- runSimulation(1000)
str(est_means)
## 'data.frame':    1000 obs. of  3 variables:
##  $ lower: num  -0.02347 -0.01687 -0.00424 -0.03402 -0.0248 ...
##  $ mean : num  -0.00375 0.00285 0.01559 -0.01433 -0.00488 ...
##  $ upper: num  0.01596 0.02257 0.03542 0.00535 0.01503 ...
summary(est_means)
##      lower               mean                upper         
##  Min.   :-0.05052   Min.   :-0.0310289   Min.   :-0.01154  
##  1st Qu.:-0.02694   1st Qu.:-0.0072949   1st Qu.: 0.01234  
##  Median :-0.02006   Median :-0.0004742   Median : 0.01912  
##  Mean   :-0.02015   Mean   :-0.0005451   Mean   : 0.01906  
##  3rd Qu.:-0.01322   3rd Qu.: 0.0063474   3rd Qu.: 0.02602  
##  Max.   : 0.01802   Max.   : 0.0374115   Max.   : 0.05680
plot(density(est_means$mean))

Normal versus uniform underlying distributions

Now we are in a position to test our hypothesis. We would like to compare the above distribution of estimated means of samples from a normal distribution, to estimated means of samples from a uniform distribution.

First let’s take a quick look at what the density plot for a uniform distribution looks like:

plot(density(runif(10000)))

We create a new function, this time sampling from a uniform distribution (we will learn more about other distributions in the next lecture). The only change we make in our code is to replace the rnorm() function with runif().

runUnifSimulation <- function(num_iterations = 100) {
  d <- data.frame(lower = numeric(num_iterations), 
                  mean  = numeric(num_iterations), 
                  upper = numeric(num_iterations))
  for (i in 1:num_iterations) {
    x <- runif(10000)
    d[i, ] <- mean(x) + c(-1.96, 0, +1.96) * sd(x) / sqrt(length(x))
  }
  return(d)
}

est_unif_means <- runUnifSimulation(1000)
plot(density(est_unif_means$mean))

We might want to compare the two plots side by side, maybe even adding a stripchart (which gets less informative as the number of points increases!)