Random Variables And Distributions

Common Formulas

rnorm, dnorm, pnorm, qnorm (and related r,d,p,q prefaces, e.g. runif), set.seed, integrate, hist, curve, polygon

Overview

Built into R are many commonly named distributions. For example, a normal distribution is abbreviated norm, a uniform distribution is abbreviated unif, a beta distribution is abbreviated beta, an exponential distribution is abbreviated exp, etc.

R offers four main functions to work with these named distributions. The syntax of these four functions prefixes a single letter (r, d, p, or q) to the abbreviations described in the previous paragraph. We will discuss what each of these prefixes does in the sections to follow. For the sake of consistent illustration, we will stick to the normal distribution for our examples (so will explore rnorm, dnorm, pnorm, and qnorm). The parameters for the specific normal distribution are the mean and standard deviation (not variance!) in R. If no parameters are specified, R assumes a standard normal distribution (mean 0, standard deviation 1).

Random Number Generation (r)

We start with the ability of R to generate random variables from a specific family of distributions, with specific parameters. As mentioned previously, R assumes by default normal distributions are standard (i.e. if parameters aren’t specified, they are assumed to be zero and 1). We can generate n-many random variables from a standard normal with rnorm(n). Since the draws are meant to simulate random draws, running the same code at different times will result in different output. To avoid this, or to ensure your code is exactly reproducible, one can use set.seed(k) where k is some number. The below picture illustrates this point. Notice how running rnorm(5) twice returns different values, but after using the same “seed”, they return the exact same 5 elements. Notice also that set.seed() must be specified before each use, as the second to last command shows.

Density Function (d)

If we are interested in the density function of a normal, we can use dnorm(x). The code takes as argument a value of the random variable, say x, and returns the probability the random variable takes the value (in the case of a discrete random variable), or the height of the probability density function (more generally). In the below, we see that the maximum height of the probability density function for a standard normal random variable is approximately 0.4 (the distribution is unimodal and symmetric, so the maximum occurs at the mean, 0).

Cumulative Density Function (p)

Where dnorm(x) returns the probability of a single value (in the case of a discrete distribution), pnorm(x) returns the cumulative probability of the random variable taking on values up to x. As such, we can mimic the behavior of pnorm by integrating dnorm with R’s built in integrate(function, lower_range, upper_range). For example, in the image below, we see the probability a random draw from a standard normal is no more than 1 is about 0.84 (i.e. about 84% of the distribution is less than 1 standard deviation larger than the mean). There are also arguments to specify the upper tail of the distribution (so, pnorm(x, upper.tail=TRUE) is the same thing as 1-pnorm(x)).

Quantile Function (q)

qnorm() inverts pnorm(). So while pnorm(x) answers the question ‘what is the probability of getting a value lower than x?’, qnorm(y) answers the question ‘what value of the random variable is needed such that y percent of the distribution lies to it’s left?’. Since probabilities lie between 0 and 1, values outside that range that are used as arguments to qnorm(x) will return errors.

Example/Putting It All Together

We can use the four functions described above to build an example. First, we generate 10,000 independent samples from a normal distribution with mean 100 and standard deviation 15 using rnorm. We can plot these random values with hist(). By default, the histogram function plots the requency of events (e.g. the number of times that “100” come up in our 10,000 random samples). We switch this default to show the density (e.g. the proportion of times that “100” comes in up in our 10,000 random samples) with the argument freq=FALSE. In addition, we specify the number of “bins”/”bin width” with the break argument, and add colors and axis titles with additional arguments to hist(). One could regard this visual as our “empirical” distribution (e.g. it represents “real-life” data being collected”).

We can also overlay the “theoretical” distribution with curve(). The “true” density of a normal distribution with mean 100 and standard deviation is given by dnorm, so our main argument to the curve() function is this dnorn(). Additional arguments to the function allow us to overlay the curve on top of our existing histogram (add=TRUE), and specify the color and line width of the curve.

We’ve used rnorm to get the empirical distribution, dnorm to get the theoretical distribution, and now use qnorm to examine quantiles of interest, i.e. the top 10% of the data. We shade the area of the distribution representing the right-most 10% of mass in green using polygon. There are two preliminaries to using polygon— specifying the “x” and “y” ranges of the graph you want to highlight. The “x” range is from the value returned from qnorm to the right-most part of the graph. The “y” range is the height of the curve (hence dnorm) for each value in the “x” range. We then use these two ranges in the polygon function by specifying the start, range, and end values in both the “x” and “y” directions. So c(mytopten, xshade, myrange[2]) starts at the value returned from qnorm, stretches through the sequence of “x” values we created with xshade, and ends at the right-most point on the graph. These are the x-values. The y-values from c(0, yshade, 0) start and end at y=0, and use the range we created from yshade to get our intermediate values. The result is the below chart.

Distribution Code

A copiable version of the underneath image is below.

###1. Recall r,d,q,p +norm (or binom or exp or beta or...)###
#1a. rnorm is used to generate random values from a normal#
rnorm(100)                                      #100 independent samples from N(0,1)#
set.seed(502)                                   #since random, can make reproducible by using same seed#
rnorm(10, mean=100, sd=2)           #Can specify parameters if not standard normal#



#1b. dnorm is used to get probabilities (the density)#
dnorm(0)                                        #Max of standard normal is ~0.4#
dnorm(1, mean=0.75, sd=0.5)        #Tighter distribution means larger values near mean#



#1c. pnorm is used to get cumulative probability (the CDF)#
val=1
pnorm(val)                                                             #The left tail of the distribution#
integrate(f=function(x) {dnorm(x)}, -Inf, val)         #CDF is integral of density; pnorm is integral of dnorm#
pnorm(val, lower.tail=FALSE)                                #The right tail of the distribution, default is TRUE#
pnorm(val, mean=0.5, sd=2, lower.tail=FALSE)    #Can specify parameters if it isn't standard normal#



#1d. qnrom is used to get the inverse of the CDF (the Quantile function)#
qnorm(0.75)                                                           #so... 75% of the time, values will be less than about 0.67#
qnorm(0.99, mean=0, sd=0.5)                               #e.g. a 1.16 sigma event#
qnorm(2)                                                               #Doesn't make sense because has to be between 0 and 1#
qnorm(pnorm(2))                                                  #inverses of each other#
pnorm(qnorm(.9))                                                 #inverses of each other#



#1e. putting everything together#
mymean=100                                                                  #First parameter of distribution#
mysd=15                                                                         #Second parameter of distribution#
n=10000                                                                         #look at sample of size 10000#
mypercent=0.1                                                               #interested in top 10% of distribution#

set.seed(502)                                                                  #to make exactly reproducible#
myrandom=rnorm(n, mean=mymean, sd=mysd)         #generate 1000 random values from distribution#
summary(myrandom)                                                     #empirical is very close to theoretical#
sd(myrandom)                                                                #very close to theoretical sd as well#

hist(myrandom,                                                               #The empirical distribution, use "r"#
     freq=FALSE,                                                               #instead of frequency, plot density#       
     main=paste0("n=", n, " Independent Samples From N(", mymean, ",", mysd, ")"),
     xlab="Observation", ylab="Density",
     col="blue",
     breaks=50)

curve(dnorm(x, mean=mymean, sd=mysd),                    #The theoretical distribution (x is a dummy variable), use "d"#
      add=TRUE, 
      col="red", 
      lwd=2)  

mytopten=qnorm(mypercent,                                        #the top 'mypercent' of the distribution, use "q"#
             mean=mymean, 
             sd=mysd, 
             lower.tail=FALSE)       
myrange=c(50, 150)    

xshade=seq(mytopten, myrange[2], length.out=100)
yshade=dnorm(xshade, mean=mymean, sd=mysd)
polygon(c(mytopten, xshade, myrange[2]), c(0, yshade, 0), 
        col=adjustcolor("green", alpha=0.75),
        border=NA)

legend("topleft", 
       legend= c("Emperical Distribution", "Theoretical Distribution", "Top 10%"),
       col=c("blue", "red", "green"),
       lwd=2)