3. Probability, distribution, parameter estimates and likelihood
Random variable and probability distribution
Imagine tossing a coin. Before you make a toss, you don’t know the result and you cannot
affect the outcome. The set of future outcomes generated by such process is called random
variable. Randomness does not mean, that you do not know anything about the possible
outcomes of this process. You know the two possible outcomes that can be produced and
also the expectation of getting one or the other (assuming that the coin is “fair”). A random
variable can thus be described by its properties. This description of the process generating
the random variable is then indicative of the expectations of individual future observations –
probabilities. We are not limited by a single observation but can consider a series of them.
Then, it makes sense to ask e.g. what is the probability to get less than 40 eagles in 100
tosses. If we do not fix the value to 40 but instead study the probabilities for all possible
vales (here from 1 to 100), we can define probability associated with each value from 1 to
100 as:
pi = P(X< xi)
where pi is the probability of observing a value lower than a given value xi. Then we can
construct the probability distribution function defined as:
f(X) = ∑ 𝑝𝑖𝑋<𝑥 𝑖
in human (non-mathematical) language, this translates as: Take probabilities of all values
lower than X, compute their sum and you get the value of probability distribution function
for value X (Fig. 3.1a). Another option to explore the distribution of values is to sample a
random variable and examine properties of such sample. After you take such sample (or
make a measurement), i.e. record events generated by a random variable, corresponding
values cease to be a random variable but become the data. The data values may be plotted
on a histogram of frequencies (Fig.3.1b; see also chapter 2). The frequency histogram can be
converted to a probability density histogram (Fig. 3.1c) by scaling the area of the histogram
to 1. The density diagram has a great advantage that probabilities of observing a value
within given interval can directly be read as size of the area of given column. The histograms
shown in Fig. 3. indicate sampling probability distribution or density based on the data. By
contrast the red lines indicate theoretical probability distribution or density; i.e. how the
values should look like if they followed the theoretical binomial distribution, which describes
the coin tossing process. As you can see, the sampling and theoretical distributions do not
match exactly, but there does not seem to be any systematic bias. The density of theoretical
probabilities can thus be viewed as an idealized density histogram. There are many types of
theoretical distributions, which describe many different processes generating random
variables. Each of these types can further have many shapes, which depends on the
parameters of the probability distribution function. E.g. the shape of the binomial
distribution, which describes our coin tossing problem, is defined by parameters p indicating
the average probability of observing one outcome and size, which is the number of trials
(tosses in our case).
Coin tossing produced discrete values to which probabilities could directly be assigned
because there is a limited number of possible outcomes. This is not possible with continuous
variables, as the number of possible values is infinite. However, if you look back at the
definition of the probability distribution function, this is not a problem because for any
value, you can find an interval of lower values.
Normal distribution
Among many theoretical distribution types, we will focus on normal (Gaussian) distribution.
This distribution describes a process producing values symmetrically distributed around the
Fig. 3.1. Probability (a), frequency (b) and density (d) distribution of coin tosses (n = 100,
size = 100, p = 0.5). Grey histograms represent sampling statistics (prob., freq., dens.).
Red lines in (a) and (c) represent theoretical binomial probability distribution and
density, respectively. (d) standard 10 crown coin of the Austrian-Hungarian Empire used
for the tossing. Depicted here to illustrate why we call the coin sides the Head and Eagle
instead of Brno and Lion as on the current 10 CZK coin.
center of the distribution. Normal distribution can be used to describe (or approximate)
distribution of variables measured on ratio and interval scale. It has two parameters, which
define its shape (Fig. 3.2a):
the central tendency (expected value), called the mean:
𝜇 =
∑ 𝑋 𝑖
𝑁
𝑖=1
𝑁
i.e. sum of all values of the variable divided by the number of objects.
and the variance, which defines the spread of the probability density:
𝜎2
=
∑ (𝑋𝑖 − 𝜇𝑛
𝑖=1 )2
𝑁
i.e. mean square of differences of individual values from the mean.
Variance is given in squared units of the variable itself (e.g. in m2 for length). Therefore,
standard deviation (σ, SD), which is simply square root of variance, is frequently used.
Common notation of the normal distribution with mean μ and variance σ2 is: N(μ, σ2).
Normal distribution has non-zero probability density over the entire scale of real numbers.
This implies that normal distribution may not always be suitable to approximate distribution
of some variables, e.g. physical variables such as length or masses because these cannot be
lower than zero. However, normal density becomes close to zero if one moves several
standard deviations (SD units) away from the mean (Fig 3.2b). This means that normal
distribution may be used for the always-positive variables (like length, mass etc.) only if the
mean is reasonably far from zero (measured by SD units). At the same time, this implies that
existence of outlying values is not expected and normal approximation of variables
containing them may be problematic.
Any normal distribution can be converted to standard normal distribution (with mean = 0
and SD = 1) by subtracting the mean of the original normal distribution and dividing the
values by SD. This procedure is called standardization.
Central limit theorem is an important statement relevant for the use of normal distribution.
It states that in many situations, when independent random variables are added, their sum
tends to converge to normal distribution even if the original variables were not normal. For
instance, biomass production in grasslands is affected by many processes (e.g. water use by
plants, photosynthesis, …) sum of which can often be reasonably approximated by normal
distribution.
Probability computation
Knowing the probability distribution of certain variables allows probabilities associated with
given intervals of the variables to be computed. For instance, a producer of clothes may
design T-shirt sizes to cover 95% of the population of customers if he knows that body size
has certain probability distribution, e.g. normal distribution described by mean and variance.
Two functions are used for the conversion between the values of the variable and
probabilities. Probability distribution function computes probabilities of observing values
lower (lower tail) or higher (upper tail) than given threshold. Quantile function is inverse to
probability distribution function and allows computing the quantiles – threshold values of
the original variable associated with given probability value.
Fig 3.2. Normal distribution: shapes of probability density of normal distributions differing in
their μ and σ2 parameters (a). Illustration of SD -unit intervals and their importance for
probability quantiles (note here that these are quantiles of probability corresponding to plot
area under the density line; not quantiles produced by quantile function) (b). Standard
normal distribution with μ = 0 and σ2 = 1 (c).
Parameter estimates, statistical sampling and likelihood
Probability computation can be a very informative analysis but it requires prior knowledge of
the theoretical distribution and its parameters. This is usually not the case. In most cases, we
have just the data, i.e. the statistical sample. This sample can be imagined as a subset of the
statistical population, i.e. possibly infinite set of all values contained in the random variable.
It seems as a logical step to estimate the population parameters from those of the sample.
Recall now the story of prisoners in the cave in chapter one. In parallel with them, we have
the information only on a fraction of reality (sample) from which we estimate how the
reality (population) looks like.
Such process of statistical inference is possible under certain conditions:
1. The type of the theoretical distribution of population values must be known or at least
assumed (the latter is the case in reality). This cannot be derived from the data. However, it
is possible to compare the sampling distribution of the data (illustrated e.g. by a histogram)
and a theoretical distribution (e.g. Fig. 3.1.c).
2. The data must be generated by random sampling from the population. If the sampling is
not random, parameter estimates get biased.
Population parameters are assumed to be fixed (as opposed to random) in classical statistics
(sometimes called frequentist statistics). This corresponds to the fact, that there is only one
true value of a single population parameter – no alternative truths are allowed. We cannot
assign any probabilities either to population parameters or to completed estimates because
probabilities can only be assigned to future outcomes of a random variable. However, we
can assign likelihood to the estimates. In continuous variables, likelihood of a parameter
value given the observed data is the product of probability densities corresponding to the
observed values which are derived from density distribution function containing given
parameter estimate. For practical reasons, we use log-likelihoods where the product
transforms into sum. Maximum likelihood estimation then involves searching for such
parameters which have the highest log-likelihood values (Fig.3.3).
Practically, the population parameters are estimated by computing estimators:
maximum-likelihood estimator of μ is the arithmetic mean:
𝑥̅ =
∑ 𝑥𝑖
𝑛
𝑖=1
𝑛
the uncertainty of the estimation of population mean can be characterized by error
associated with 𝑥̅. This is called standard error of the mean (SE, 𝑠 𝑥̅):
𝑠 𝑥̅ =
𝑠
√ 𝑛
as you can see the uncertainty about the population mean decreases with square-root of the
number of observations. The more observations, the more precise inference!
maximum-likelihood estimator of population variance is sample variance:
𝑠2
=
∑ (𝑥𝑖 − 𝑥̅𝑛
𝑖=1 )2
𝑛 − 1
Note the difference in the denominator between formulae of sample and population
variances. Sample standard deviation s = √𝑠2
Fig 3.3. Maximum likelihood estimation of normal distribution parameters. A sample (n = 50)
was sampled from a normally distributed population with μ = 10 and σ2 = 4. Maximum
likelihood estimation was then performed on the sample aiming at reconstruction of the
population parameters. Mean value was estimated 𝑥̅ = 9.57 and variance s2 = 3.37.
Corresponding probability density function was plotted onto the sampling density histogram
(a). Log-likelihoods of a series of possible mean and variance values are plotted together
with the estimated and population parameters (b, c). Note that in real-life statistical
inference, the information on population parameters is not known.
I guess, you may now think I am completely crazy. It took no less than 6 pages to explain all
the complicated principles of probability calculation, likelihood and parameter estimate to
end up with simple calculation of arithmetic mean and variance! However, you will see that
it was worth it. In following classes, we will discuss other probability distributions, which are
less intuitive than the normal. So, it may make sense to have the first look at what is rather
intuitive and familiar. It may also seem possible to rely on the simple calculation of mean
and variance and not bothering about the underlying principles. But then, you run into the
risk of misuse these statistics such as using the arithmetic mean to determine final grades at
schools (school grades indeed do not follow the normal distribution and arithmetic mean is a
very poor estimator of the central tendency of their distribution). Note also that the
principles of statistical inference (e.g. the distinction between sample and population)
described here have very universal importance and represent the core of statistical theory.
How to do in R
Normal distribution probability: pnorm
parameter q in this function refers to quantiles, i.e. the values
of the original variable.
parameter lower.tail with possible values T (the default) or F
indicates whether probability of observing lower or higher value
than a given threshold is to be computed, respectively.
Normal distribution quantile function: qnorm
parameter p in this function refers to probability(ies), i.e.
the values of normal probability distribution function for which
the corresponding quantiles (values of the original variable)
should be computed.
Function rnorm can be used to generate a sample (series of
values) of normal distribution (was employed e.g. for Fig. 3)
Functions for parameter estimates:
arithmetic mean: mean
standard error of the mean: there is no dedicated function in
the default packages. Function se can be found in package
sciplot. Alternatively, it is possible to create a custom
function for this:
se<-function(x) sd(x)/sqrt(length(x})
variance: var
standard deviation: sd